Number of reference sequence contigs in sam/bam header does not match the number in my fasta
1
0
Entering edit mode
7.1 years ago

I created a de novo assembly fasta (masked) which I then used to align the fastq sequences to. I noticed that when building the bowtie 2 reference index for the assembly there were many sequences that were all N's (662,820 to be exact) which I expected from the masking. After alignment I did a grep for the @SQ tag in the sam header which should tell me how many contigs are in the reference; there were 2,118,137 listed in the sam header. Finally, I did a grep for the total number of sequences in the reference fasta file but there were only 2,449,547.

Any reason why these numbers wouldn't add up? Where does the number of contigs in the sam header actually come from?

fasta sam bam contig bowtie2 • 1.8k views
ADD COMMENT
0
Entering edit mode
7.1 years ago

The number of reference sequences comes from the aligner. Possibly, the aligner is ignoring sequences under a certain length, or all sequences that are entirely N. I suggest you skip masking/filtering, or record how many sequences you filtered, and try again.

ADD COMMENT

Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6