Question

Pileup format parsing difficulty

0

Entering edit mode

7.5 years ago

maheetha.b ▴ 70

Hello all, I'm very new to samtools. In my pileup, I'm getting lines that read something along the lines of this

chr1 2XXXX N 26 <<<<<<<<<<>>>>>>CCCccccCCC AAAAAAAAAADDDDDDGIJIJJJJIJ

I'm not sure as to what the <<<<>>>> means, although my guess would be that it has to do with forward and reverse strand. Similarly, I'm know that the capital and lower case C means that some reads didn't match the forward or reverse strand, but I'm not sure exactly what that means. Does it mean that the forward strand was something else other than a C, and my read had a C, or the other way around?

Some assistance would be helpful.

pileup samtools • 1.8k views

ADD COMMENT • link updated 7.5 years ago by John Marshall 3.0k • written 7.5 years ago by maheetha.b ▴ 70

score 1 · Answer 1 · 2016-10-07

Have you read the documentation under mpileup in the samtools manual page?

In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column [etc]

The <> characters each indicate a read that has skipped this reference position (due to CIGAR N). This column is named read bases (and the reference base is shown in column 3 — or would be if you used the -f REF mpileup option), so the Cc etc characters show the bases in your read.