Is They Any Software Or Technique To Find The Number Of Reads Mapped To Each Contig After Aligment
7
3
Entering edit mode
10.3 years ago
bambus0725 ▴ 50

HI,

I work with meta-transcriptomics paired end data sequenced using Illumina hiseq technology.I performed denovo-assembly using Trinity on RNA-seq(separately on forward and reverse) data followed by mapping with bowtie2.Now,I am trying to count the reads that were mapped to each contig from the generated sam output file after mapping as it is very ambiguous with nearly 11,000,000 reads.

Just to avoid confusion I extracted only 2 columns from the sam output file that I am interested in

                   query                          contig
HWI-ST365:262:C0RY1ACXX:6:1114:13971:74078        comp59482_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp4933_c6_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp5103_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp5696_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp5503_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp6262_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp5296_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp40032_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp4933_c6_seq4
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp11733_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp5068_c0_seq1
HWI-ST365:262:C0RY1ACXX:6:2102:6548:15712         comp22661_c0_seq1

Thank you in advance

mapping RNA-seq • 9.6k views
ADD COMMENT
1
Entering edit mode
10.3 years ago
  1. Convert your SAM file to BAM file.
  2. Sort the BAM file using chromosomal coordinates.
  3. Then try: samtools view -F 4 Input.bam | cut -f3 | uniq -c

You can use samtools (http://samtools.sourceforge.net/samtools.shtml)

Or you can try awk command on your SAM file:

awk '{if ($5 > 1)cnt[$3]++}END{for (x in cnt){print x,cnt[x]}}'  Input.sam
ADD COMMENT
0
Entering edit mode

i think, that if the read is mapped twice to same contig, it will be counted two times

ADD REPLY
0
Entering edit mode

Well I was trying to go with the simplest solution. I thought bambus just wants to get some idea about the coverage on different chromosomes. Of course, there are many factors that need to be considered. The best way would be to convert the SAM to BAM and use flags to flag the reads that are non-uniquely mapped and duplicates and then process the unflagged reads.

ADD REPLY
0
Entering edit mode
10.3 years ago
Varun Gupta ★ 1.3k
#!usr/bin/perl-w
use strict;
use warnings;

my %hash = ();
while(<>){
        chomp;
    next if($_ =~ /^@/); ## remove the headers in sam file
        my @s =split;
        my $contig = $s[2];
        $hash{$contig}++;
}

foreach (sort keys %hash){
        print join ("\t", $_,$hash{$_}),"\n";
}

Run the above code on the sam file as

perl read_count.pl test.sam > counts.txt

Hope this helps

Varun

ADD COMMENT
0
Entering edit mode

doesn't account for duplicates too

ADD REPLY
0
Entering edit mode
10.3 years ago
Pavel Senin ★ 1.9k

to count unique read names per contig, you can first extract unique pairs <read> <contig> and then count only contig names:

samtools view file.bam | cut -f1,3 | sort | uniq | cut -f2 | sort | uniq -c

This will yield a list of pairs <number of reads> <contig>

ADD COMMENT
0
Entering edit mode
10.3 years ago
Andreas ★ 2.5k

I might understand this wrongly but if you have a BAM file (which was indexed with samtools index) you can just run

samtools idxstats your.bam

which will list the "name, sequence length, # mapped reads and # unmapped reads" for each sequence/chromosome (see the idxstats section at http://samtools.sourceforge.net/samtools.shtml)

Andreas

Edit: Missed the fact that you were looking for unique counts

ADD COMMENT
0
Entering edit mode

i think (pretty sure) that it will double count if a read is aligned to the same contig twice

ADD REPLY
0
Entering edit mode

as far as I checked a read was mapped many times to different contigs but not that a single read was mapped many times to a single contig,but I will check it again

ADD REPLY
0
Entering edit mode

I have an example for one of contigs from my data named 1798943 where many reads are aligned twice:

  $samtools view 1798943.bam | cut -f1 | sort | uniq -c | sort -nr -k1 | head
  2 D3VDZHS1:96:D1WFDACXX:6:2316:18430:37842
  2 D3VDZHS1:96:D1WFDACXX:6:2316:15705:86382
  2 D3VDZHS1:96:D1WFDACXX:6:2316:15376:74645
  ...

idxstats sums up to 944 alignments (i call for grep to remove alignments which are all zeroes due to the sam header)

  $samtools idxstats 1798943.bam | grep -v '0\s0'
  1798943    106    944    0

while there are only 601 reads

  $samtools view 1798943.bam | cut -f1 | sort | uniq | wc -l
  601
ADD REPLY
0
Entering edit mode

ok,but unfortunately I couldn't find out if something exists like this in my file as it contains nearly 11,000,000 reads.Is it possible to find?

ADD REPLY
0
Entering edit mode

by extracting, sorting, and counting pairs <read id> <contig id> you'll be able to find contigs which have the same read aligned to them twice - then, you look into those.

ADD REPLY
0
Entering edit mode

I could overcome the problem of a single read aligning to more than 1 contig by executing BOWTIE2 again with default "reporting" option instead of "-a".It just prints out the alignment with best mapping score.So,I have no repetition of reads. 1read ----only---> 1 contig

ADD REPLY
0
Entering edit mode

yes, that is right. but let me explain my reasoning: you see, this particular behavior - forcing an aligner not to report multiple positions - is not a correct way to map reads back to assembly results, because an aligner will randomly place the read on the set of valid positions, thus, the mapping result will not be reproducible

ADD REPLY
0
Entering edit mode
10.3 years ago
bambus0725 ▴ 50

Hi Varun,

Thanks a lot.

Yes It is working,but somehow it is counting 1 time less than the actual number of times it is mapped.Just to be clear if a contig is mapped to "29" ID's it counts only "28" and is it possible to print the "ID's" that were mapped along as the 3rd column with the total number like now.

ADD COMMENT
0
Entering edit mode

Hi, Don't really think that it is counting the wrong way, but try this first samtools view -S -h -F 4 test.sam | perl read_count.pl > count.txt Does this give you the correct number. Also convert sam to bam and then do samtools idxstats test.bam the mapped reads should be equal to sum of col2 of count.txt file

Also do you want print the name of all 29 ID's that was mapped to that one contig and then for other contigs????. I can output it in 3rd column but comma separated. Let me know

ADD REPLY
0
Entering edit mode

Hi Varun,

Thank you once again,it's really a big favour to me.

I did the steps you mentioned here (but if I am right I could use samtools "view" only with .bam file)and you are right it is giving me the same total for both methods,but I am not clear why does it count 1 time more when I search a "contig ID" in sam file using 'grep'.

Yes,could you please modify the script so that it will also print the read ID and its respective sequence in 3rd and 4th coulmns also tab separated is ok.And only if possible can it also ouput a second file( a list of unmapped reads or contigs or both) if incase there are unmapped reads and contigs as this happend in my case.

hoping to hear from you soon.

Like for example an input sample.fa contains 2,210,379 reads is used for assembly(TRINITY)resulted in 56,496 contigs.Once after aligning using(BOWTIE2)only 1,187,052 reads were mapped back.And now after counting the total number of reads mapped to each contig and summing up all gives different number that is 1186786 reads were mapped back to only 53,850 contigs..I am trying to find out why it is so?

ADD REPLY
0
Entering edit mode

You can use samtools view with sam file. I edited my reply. my bad i missed -S option. Ok lets solve the prob one at a time. can you tell me your grep command you are using??

ADD REPLY
0
Entering edit mode

oh ok and here is the command I used grep -c "term to count" <filename< p="">

ADD REPLY
0
Entering edit mode

what is the input file???

ADD REPLY
0
Entering edit mode

contigs.fa after assembly is the input file or reference file and the original reads file to map

ADD REPLY
0
Entering edit mode

ok did you use this command grep -c "contig_name" test.sam If yes you will get 1 count extra because grep is also counting your header line which has your contig name Did you get it??

ADD REPLY
0
Entering edit mode
10.3 years ago
bambus0725 ▴ 50

Hi Pavel Senin ,

Thank you for the reply.

Actually what you meant in the first post was right. Regardless of how many times a read is mapped through out the file, at the end it should print "contig name" and total number of reads mapped to that particular contig along with the list of read ID's.

ADD COMMENT
0
Entering edit mode

it would be easier to read and to discuss, if you transfer these replies into the comments section. so, you want the list like <contig id> <number of mapped reads to contig> <read id> ? or two separate lists one is <contig id> <number of mapped reads to contig> and the second is <contig id> <read id>? latter is easy to get, while former requires some additional effort (i guess)

ADD REPLY
0
Entering edit mode

Ok.

Yes,I need to make a single list or a table with contig id> number of mapped reads> read id> read>

ADD REPLY
0
Entering edit mode

I think you'll need to write some code for that - you'll need to parse the data twice - using first pass results. out of curiosity, why would you need that - to make a table with 11M of reads?

ADD REPLY
0
Entering edit mode

Yes. I want to find the coverage,but the total number of reads mapped should be around 4,000,000 since a single is mapped many times to different contigs the number went high or may be due to the reason u explained

ADD REPLY
0
Entering edit mode
10.3 years ago
bambus0725 ▴ 50

Hi ashutoshmits,

Thank you for the answer.

I didn't try with the first option yet,but with "awk" it doesn't gave the total count correctly.

ADD COMMENT

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6