How produce a table with number of copies per unique sequence? (possible?)
0
0
Entering edit mode
4.9 years ago
angelaparody ▴ 100

Hi,

First of all, I am not a bioinformatician or computational person, I am a molecular biologist. I have some fastq.gz files and I generated a fastQC report that says that I have lots of overrepresented sequences (> 50%!). Probably due to the nature of the genome (there is no reference genome, and my guess is that the restriction enzyme used has favoured the sequencing of repetitive elements (??)).

What I am interested is in knowing how many unique sequences/reads have certain number of copies (coverage), to see how many unique reads have a moderate number of copies (x20-50 copies). Any idea of how to get this information? Would it be possible through a command? Would it be possible to produce a txt file with two columns, one with number of unique sequences and second column with number of copies on those unique sequences? Maybe, in other words, what I am trying to get is the distribution of number of copies in unique sequences.

Thanks in advance,
Ángela

fastq • 803 views
ADD COMMENT
0
Entering edit mode

Please take a look at these blog posts from author's of FastQC.

Duplication (https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/ )
Positional bias ( https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/ )

You could de-duplicate this data if you want to count copies of reads with identical sequences ( see: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) You would use the program like this

clumpify.sh in=file.fq out=deduped.fq dedupe addcount

Reads that absorbed duplicates will have "copies=X" appended to the end of fastq header to indicate how many reads they represent (including themselves, so the minimum you would see is 2).

ADD REPLY

Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6