Sub_sampling Paired_end reads in Fastq.gz format.
1
0
Entering edit mode
6.3 years ago
XBria ▴ 90

Hi,

Following my last post (to down-sample paired-end read data).

-Do I have to align the whole data on the genome and then filter those regions I will ? (costs much time !!)

-Is the following command right ? samtools view -b ¨chr1¨ > ERR188044_1_chr1 (version 0.01.19) -Does the same go for the reverse read file too ?

-Do I have to determine the start and end position too ? or just writing chr1 is enough ? (in manual I see only the name of the chromosome is also enough, http://www.htslib.org/doc/samtools-0.1.19.html)

-In combination of two or more regions how may I write the command ? for example combination of chr1 and chr2 ?

-How may I say how much percent of the whole data is filtered ?

Is there anyone who could please answer all these question ? Thanks.

RNA-Seq paired-end • 1.7k views
ADD COMMENT
0
Entering edit mode

Where is the link to your last post? Your questions are not very clear to me. But if you are interested in downsampling paired-end fasta, you may try seqtk sample command https://github.com/lh3/seqtk

ADD REPLY
0
Entering edit mode

How to down-sample a full data please have a look at the link

ADD REPLY
0
Entering edit mode

Please upgrade to latest samtools (if you are still using v. 0.1.19). There is no reason to continue using that ancient version.

I am not sure why you are sub-sampling aligned data when you have the original dataset available? As @Devon indicated in the last thread using raw reads (and adding more of them step-wise) should be the way to go.

ADD REPLY
0
Entering edit mode

Dear Genomax,

The reason why I am using the old version is that I have to work with the same version of software packages used in New Tuxedo paper (Hisat, StringTie,Ballgown).

Please have a look at theis link https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032908/

in the Overview of the protocol section, look up please the chromosome X, I have to do the same but in a bit larger than chromosome X (maybe X and 1, and the next time X and 1 and 2 chromosomes). I have to repeat it a couple of times and enlarge the files. Then the mapping is done on the larger files too. in the end results will be analyzed.

Please let me know if you have any advice.

Thanks.

ADD REPLY
0
Entering edit mode

From tuxedo paper:

version 0.1.19 or later

You can use the latest version (see the or later part). By the time you are done with alignments samtools is not going to make any changes to the alignment results.

Reads that they have provided in the download from tuxedo paper are only a subset selected that map to chrX. What do you hope to achieve by doing this for additional chromosomes? These example reads should not map significantly to any new chromosomes you add.

I have to repeat it a couple of times and enlarge the files.

How are you going to do that. Align the full original data and extract reads from chr 1 then 2 etc? It is not clear what you are trying to do, at least to me.

ADD REPLY
0
Entering edit mode

computational time is what I need to observe. To see how aligner x performs in comparison with another aligner.

ADD REPLY
2
Entering edit mode

This would not be the way to do it then.

You could sub-sample original full fastq data files and then align those subsets (n, 2n, 3n etc) against entire genome/a chromosome or whatever you choose as the reference. That said, computational time is not a good criteria for comparison since you are assuming that every aligner is going to work equally well (not likely) and the defaults are identical across all of them (again not the case for sure).

See this for more.

ADD REPLY
0
Entering edit mode

Thanks Genomax.

Your comment looks very helpful.

ADD REPLY
2
Entering edit mode
6.3 years ago

Do I have to align the whole data on the genome and then filter those regions I will ? (costs much time !!)

I guess you have to, if you want to retain only the reads that map on chromosome 1. If we knew the positions already, why would we even map them?

Is the following command right ? samtools view -b ¨chr1¨ > ERR188044_1_chr1 (version 0.01.19) -Does the same go for the reverse read file too ?

This command selects the bam entries that map on chromosome 1. Beware that -b will generate a bam output (compressed) without the header (that requires the -h option); the header is important for following usages of bam files, so keep it.

To get the reads, you need to extract them from the bam file. Since I see you're not very expert in this, I advice you against doing this yourself, try instead with bedtools bamtofastq. There you can also specify to separate the output in read_1 and read_2 files. Beware that the bamtofastq module requires the reads to be sorted by name (samtools sort -n).

Do I have to determine the start and end position too ? or just writing chr1 is enough ? (in manual I see only the name of the chromosome is also enough, http://www.htslib.org/doc/samtools-0.1.19.html)

Just the chromosome is enough, but I suppose you have to index your bam file (samtools index) and you will only be available to index it if it is sorted by genome coordinate (samtools sort). This is not the same sort as before.

In combination of two or more regions how may I write the command ? for example combination of chr1 and chr2 ?

Why not making a bed file with the coordinates you want to retain, to use with the -L options of samtools view instead?

How may I say how much percent of the whole data is filtered ?

samtools stats will tell you, among a lot of things, the number of reads (and of pairs) that you have in your bam file. If you run samtools stats for both the unfiltered and the filtered files, simply dividing those two numbers will tell you this bit.

ADD COMMENT
0
Entering edit mode

Thanks Macspider for such a comprehensive answer :) That is what I needed.

ADD REPLY
0
Entering edit mode

Dear Macspider, Can YOU please answer my two last questions that is not answered yet in the following link ? I need to finish the project. Please help me understand it. Thanks, C: Parameter optimization STAR

ADD REPLY
0
Entering edit mode

I read the whole thread now, but I have to say I don't think I can add anything. What @genomax and @wouterdecoster have told you is what I would have said myself!

ADD REPLY

Login before adding your answer.

Traffic: 2885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6