Pileup format parsing difficulty
1
0
Entering edit mode
7.5 years ago
maheetha.b ▴ 70

Hello all, I'm very new to samtools. In my pileup, I'm getting lines that read something along the lines of this

chr1 2XXXX N 26 <<<<<<<<<<>>>>>>CCCccccCCC AAAAAAAAAADDDDDDGIJIJJJJIJ

I'm not sure as to what the <<<<>>>> means, although my guess would be that it has to do with forward and reverse strand. Similarly, I'm know that the capital and lower case C means that some reads didn't match the forward or reverse strand, but I'm not sure exactly what that means. Does it mean that the forward strand was something else other than a C, and my read had a C, or the other way around?

Some assistance would be helpful.

pileup samtools • 1.8k views
ADD COMMENT
1
Entering edit mode
7.5 years ago

Have you read the documentation under mpileup in the samtools manual page?

In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column [etc]

The <> characters each indicate a read that has skipped this reference position (due to CIGAR N). This column is named read bases (and the reference base is shown in column 3 — or would be if you used the -f REF mpileup option), so the Cc etc characters show the bases in your read.

ADD COMMENT

Login before adding your answer.

Traffic: 2893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6