Question

Sub_sampling Paired_end reads in Fastq.gz format.

0

Entering edit mode

6.3 years ago

XBria ▴ 90

Hi,

Following my last post (to down-sample paired-end read data).

-Do I have to align the whole data on the genome and then filter those regions I will ? (costs much time !!)

-Is the following command right ? samtools view -b ¨chr1¨ > ERR188044_1_chr1 (version 0.01.19) -Does the same go for the reverse read file too ?

-Do I have to determine the start and end position too ? or just writing chr1 is enough ? (in manual I see only the name of the chromosome is also enough, http://www.htslib.org/doc/samtools-0.1.19.html)

-In combination of two or more regions how may I write the command ? for example combination of chr1 and chr2 ?

-How may I say how much percent of the whole data is filtered ?

Is there anyone who could please answer all these question ? Thanks.

RNA-Seq paired-end • 1.7k views

ADD COMMENT • link updated 6.3 years ago by Matteo Schiavinato ★ 3.6k • written 6.3 years ago by XBria ▴ 90

0

Entering edit mode

Where is the link to your last post? Your questions are not very clear to me. But if you are interested in downsampling paired-end fasta, you may try seqtk sample command https://github.com/lh3/seqtk

ADD REPLY • link 6.3 years ago by Santosh Anand 5.7k

0

Entering edit mode

How to down-sample a full data please have a look at the link

ADD REPLY • link 6.3 years ago by XBria ▴ 90

0

Entering edit mode

Please upgrade to latest samtools (if you are still using v. 0.1.19). There is no reason to continue using that ancient version.

I am not sure why you are sub-sampling aligned data when you have the original dataset available? As @Devon indicated in the last thread using raw reads (and adding more of them step-wise) should be the way to go.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

Dear Genomax,

The reason why I am using the old version is that I have to work with the same version of software packages used in New Tuxedo paper (Hisat, StringTie,Ballgown).

Please have a look at theis link https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032908/

in the Overview of the protocol section, look up please the chromosome X, I have to do the same but in a bit larger than chromosome X (maybe X and 1, and the next time X and 1 and 2 chromosomes). I have to repeat it a couple of times and enlarge the files. Then the mapping is done on the larger files too. in the end results will be analyzed.

Please let me know if you have any advice.

Thanks.

ADD REPLY • link 6.3 years ago by XBria ▴ 90

0

Entering edit mode

From tuxedo paper:

version 0.1.19 or later

You can use the latest version (see the or later part). By the time you are done with alignments samtools is not going to make any changes to the alignment results.

Reads that they have provided in the download from tuxedo paper are only a subset selected that map to chrX. What do you hope to achieve by doing this for additional chromosomes? These example reads should not map significantly to any new chromosomes you add.

I have to repeat it a couple of times and enlarge the files.

How are you going to do that. Align the full original data and extract reads from chr 1 then 2 etc? It is not clear what you are trying to do, at least to me.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

computational time is what I need to observe. To see how aligner x performs in comparison with another aligner.

ADD REPLY • link 6.3 years ago by XBria ▴ 90

2

Entering edit mode

This would not be the way to do it then.

You could sub-sample original full fastq data files and then align those subsets (n, 2n, 3n etc) against entire genome/a chromosome or whatever you choose as the reference. That said, computational time is not a good criteria for comparison since you are assuming that every aligner is going to work equally well (not likely) and the defaults are identical across all of them (again not the case for sure).

See this for more.

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

Thanks Genomax.

Your comment looks very helpful.

ADD REPLY • link 6.3 years ago by XBria ▴ 90

score 2 · Accepted Answer · 2018-01-17

Do I have to align the whole data on the genome and then filter those regions I will ? (costs much time !!)

I guess you have to, if you want to retain only the reads that map on chromosome 1. If we knew the positions already, why would we even map them?

Is the following command right ? samtools view -b ¨chr1¨ > ERR188044_1_chr1 (version 0.01.19) -Does the same go for the reverse read file too ?

This command selects the bam entries that map on chromosome 1. Beware that -b will generate a bam output (compressed) without the header (that requires the -h option); the header is important for following usages of bam files, so keep it.

To get the reads, you need to extract them from the bam file. Since I see you're not very expert in this, I advice you against doing this yourself, try instead with bedtools bamtofastq. There you can also specify to separate the output in read_1 and read_2 files. Beware that the bamtofastq module requires the reads to be sorted by name (samtools sort -n).

Do I have to determine the start and end position too ? or just writing chr1 is enough ? (in manual I see only the name of the chromosome is also enough, http://www.htslib.org/doc/samtools-0.1.19.html)

Just the chromosome is enough, but I suppose you have to index your bam file (samtools index) and you will only be available to index it if it is sorted by genome coordinate (samtools sort). This is not the same sort as before.

In combination of two or more regions how may I write the command ? for example combination of chr1 and chr2 ?

Why not making a bed file with the coordinates you want to retain, to use with the -L options of samtools view instead?

How may I say how much percent of the whole data is filtered ?

samtools stats will tell you, among a lot of things, the number of reads (and of pairs) that you have in your bam file. If you run samtools stats for both the unfiltered and the filtered files, simply dividing those two numbers will tell you this bit.