Question

Algorithm for Getting per position nucleobases in python from bam file

0

Entering edit mode

7.5 years ago

ammarsabir15 ▴ 70

I want to extract nucleotide per position out of a bam file in python using pysam module of python.
As i am a beginner so I want to ask that
Which part of bam file tells about position??
How to calculate nuleotides(A,T,G,C) present on this position(i.e what logic should be used) using python's pysam??
So anyone tell me about this.
The sample bam file and the code used to print it is shown below.

.

#!/usr/bin/python
import pysam
samfile = pysam.Samfile( "reads.sorted.bam", "rb" )
for line in samfile:
    print(line)

The above code prints the bam file as follows.

72NUT:00012:01818   0   0   0   0   4M22I250M   -1  -1  276 CCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTTTGGGGGGTATGCACGCGATAGCATTGCGGGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCCATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTACTAAAGCGTATTAATTAATTAATGCTTGTAGGACATAACAATAACAATTAAATGTCT    array('B', [35, 32, 35, 32, 34, 34, 30, 34, 34, 32, 34, 34, 32, 34, 34, 34, 34, 33, 33, 33, 34, 34, 32, 33, 33, 34, 34, 34, 34, 34, 32, 34, 35, 35, 34, 33, 33, 34, 34, 35, 30, 34, 34, 34, 32, 34, 31, 34, 31, 34, 25, 25, 25, 34, 23, 23, 23, 11, 23, 22, 17, 25, 34, 35, 35, 32, 34, 34, 34, 34, 34, 34, 35, 30, 34, 32, 30, 30, 30, 34, 34, 24, 34, 34, 34, 34, 29, 34, 34, 34, 34, 34, 16, 34, 34, 30, 30, 26, 32, 32, 32, 34, 32, 32, 32, 30, 30, 29, 25, 26, 21, 26, 26, 25, 36, 27, 33, 31, 31, 26, 26, 26, 26, 30, 31, 31, 30, 36, 29, 30, 30, 30, 30, 31, 34, 26, 30, 30, 30, 33, 33, 33, 30, 30, 30, 31, 31, 31, 33, 30, 30, 30, 33, 33, 33, 33, 26, 33, 33, 33, 28, 32, 28, 32, 31, 31, 28, 31, 31, 31, 30, 30, 30, 24, 30, 34, 30, 35, 36, 35, 19, 26, 26, 26, 33, 33, 29, 30, 27, 30, 30, 30, 30, 25, 20, 15, 15, 11, 15, 15, 15, 18, 23, 23, 33, 33, 29, 26, 26, 26, 30, 30, 30, 29, 30, 30, 30, 28, 32, 32, 30, 30, 27, 11, 25, 23, 23, 25, 25, 33, 30, 33, 29, 33, 28, 31, 20, 25, 19, 25, 22, 27, 27, 31, 31, 17, 23, 23, 29, 27, 22, 27, 32, 25, 26, 25, 28, 32, 33, 21, 26, 26, 30, 26, 26, 23, 26, 30, 33, 33, 24, 30, 30, 30, 26, 26])    [('AS', -138), ('XN', 0), ('XM', 14), ('XO', 1), ('XG', 22), ('NM', 36), ('MD', '0G0A1C59C8G19A56T1T36A5C8T2G28T10G7'), ('YT', 'UU')]
72NUT:00158:00470   16  0   0   3   4M7I213M    -1  -1  224 CACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGGGCTCTCCATGCATTTGGTATTTTCGTTTGGGGGGTATGCACGCGATAGCATTGCGGGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCCATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTACTAAAGCGTATTAATTAATT    array('B', [21, 21, 22, 23, 14, 14, 9, 15, 15, 15, 23, 34, 23, 23, 17, 25, 25, 25, 23, 27, 27, 27, 25, 18, 26, 26, 26, 32, 28, 33, 30, 30, 21, 25, 25, 16, 16, 16, 22, 21, 9, 23, 23, 23, 23, 22, 22, 21, 30, 30, 29, 25, 25, 25, 23, 27, 17, 27, 23, 17, 23, 23, 23, 9, 23, 23, 23, 23, 27, 21, 38, 32, 11, 32, 32, 32, 32, 23, 28, 28, 28, 28, 29, 34, 34, 34, 34, 34, 34, 34, 30, 25, 23, 25, 17, 25, 25, 34, 27, 36, 35, 34, 34, 34, 34, 34, 32, 34, 34, 34, 31, 34, 31, 34, 34, 34, 34, 34, 29, 35, 34, 34, 34, 34, 34, 34, 30, 30, 36, 30, 30, 26, 30, 33, 33, 35, 25, 28, 25, 14, 25, 25, 23, 23, 17, 33, 30, 23, 25, 25, 20, 25, 23, 28, 35, 27, 21, 30, 30, 29, 22, 27, 23, 11, 23, 25, 25, 29, 29, 34, 36, 34, 32, 34, 30, 30, 30, 34, 31, 34, 34, 31, 35, 34, 34, 32, 34, 34, 34, 34, 32, 34, 34, 35, 35, 34, 34, 34, 34, 34, 34, 30, 31, 31, 26, 26, 15, 26, 26, 26, 34, 31, 34, 34, 31, 34, 32, 34, 31, 34, 31, 34, 32, 34])   [('AS', -85), ('XN', 0), ('XM', 13), ('XO', 1), ('XG', 7), ('NM', 20), ('MD', '0G1T0C32A26C8G19A56T1T36A5C8T2G10'), ('YT', 'UU')]
72NUT:00473:00766   0   0   0   0   4M12I165M1I19M1I15M1I2M1I22M1D35M1I28M  -1  -1  307 GACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTTTGGGGGGTACGCACGCGATAGCATTGCGGGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCCATTATTTATCGCACCTATCGTTCAATATTACAGGCGAGTCATACTTACTAAAGCTGATATTAATTAATTAATGCTTGTAGACATAACAATAACAATTAAATGTCTGCACAGCCGTCTTTCCACACAGACATCATAACAAAAAA array('B', [35, 34, 34, 34, 34, 35, 35, 35, 30, 31, 31, 34, 32, 34, 34, 34, 34, 34, 34, 34, 32, 34, 34, 34, 34, 34, 34, 30, 30, 30, 26, 30, 34, 34, 31, 34, 32, 34, 32, 35, 34, 32, 32, 34, 26, 25, 25, 14, 26, 25, 31, 31, 31, 31, 33, 26, 29, 29, 34, 34, 36, 34, 35, 30, 34, 32, 35, 34, 34, 34, 34, 24, 34, 34, 32, 33, 11, 23, 23, 23, 23, 23, 8, 23, 31, 34, 34, 34, 34, 34, 34, 34, 34, 29, 30, 30, 34, 34, 34, 34, 32, 30, 26, 26, 36, 24, 34, 34, 34, 34, 34, 35, 26, 29, 30, 31, 29, 29, 33, 35, 30, 26, 28, 34, 34, 29, 35, 35, 34, 32, 32, 36, 34, 35, 31, 31, 31, 34, 32, 33, 32, 34, 26, 29, 29, 34, 25, 30, 30, 34, 32, 34, 32, 34, 34, 35, 32, 34, 34, 34, 34, 33, 34, 26, 30, 30, 26, 29, 35, 36, 28, 26, 29, 34, 27, 14, 15, 14, 23, 26, 14, 14, 13, 13, 26, 8, 13, 13, 8, 19, 19, 34, 28, 34, 34, 34, 34, 28, 17, 17, 17, 17, 17, 19, 19, 19, 19, 17, 33, 20, 25, 30, 13, 29, 29, 21, 27, 14, 14, 14, 14, 13, 18, 19, 18, 18, 18, 32, 13, 17, 13, 16, 8, 16, 8, 16, 25, 30, 19, 6, 12, 12, 12, 12, 12, 20, 12, 12, 12, 18, 12, 17, 13, 17, 24, 18, 17, 24, 13, 17, 22, 33, 33, 14, 25, 25, 31, 25, 25, 17, 17, 31, 25, 13, 13, 13, 8, 12, 12, 12, 12, 18, 8, 12, 8, 16, 18, 18, 16, 16, 21, 32, 32, 30, 30, 30, 21, 19, 19, 25, 16, 12, 12, 12, 12, 12, 4])  [('AS', -147), ('XN', 0), ('XM', 13), ('XO', 7), ('XG', 18), ('NM', 31), ('MD', '2T0C59C8G0T18A56T1T36A5C11G20^G7T10G44'), ('YT', 'UU')]
72NUT:00541:01644   0   0   0   3   4M5I265M    -1  -1  274 CGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTTTGGGGGGTATGCACGCGATAGCATTGCGGGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCCATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTACTAAAGCGTATTAATTAATTAATGCTTGTATGACATAACAATAACAATTAAATGTCTGCACAGCCGTCTTTC  array('B', [36, 34, 34, 30, 30, 30, 31, 34, 34, 35, 34, 34, 34, 32, 30, 30, 30, 34, 34, 33, 29, 30, 29, 26, 29, 33, 34, 32, 34, 33, 35, 32, 34, 34, 34, 34, 34, 29, 26, 29, 26, 29, 27, 27, 31, 34, 35, 35, 32, 34, 31, 31, 30, 34, 34, 36, 30, 35, 31, 34, 34, 34, 35, 34, 25, 34, 34, 34, 36, 25, 26, 29, 26, 26, 26, 10, 26, 34, 34, 34, 34, 34, 30, 30, 30, 34, 34, 34, 30, 30, 30, 34, 34, 32, 34, 34, 34, 35, 29, 35, 36, 36, 34, 31, 30, 29, 30, 34, 34, 31, 34, 32, 34, 35, 35, 31, 31, 31, 28, 31, 34, 34, 37, 31, 31, 32, 39, 35, 31, 31, 31, 34, 34, 34, 30, 31, 31, 31, 37, 29, 34, 34, 34, 29, 30, 28, 30, 34, 34, 30, 34, 34, 34, 29, 30, 30, 27, 30, 35, 32, 34, 34, 34, 28, 30, 30, 34, 34, 34, 34, 35, 32, 34, 34, 29, 29, 29, 22, 26, 34, 29, 26, 26, 26, 28, 26, 26, 26, 25, 19, 25, 25, 25, 24, 19, 19, 19, 19, 17, 20, 30, 34, 34, 34, 34, 34, 29, 34, 29, 20, 17, 24, 24, 20, 20, 25, 23, 32, 35, 31, 30, 26, 30, 26, 30, 30, 30, 27, 11, 14, 15, 15, 14, 14, 14, 15, 27, 27, 29, 26, 31, 34, 32, 34, 35, 32, 34, 34, 31, 34, 31, 34, 34, 24, 25, 25, 25, 25, 25, 25, 25, 25, 31, 14, 14, 14, 21, 14, 14, 14, 26, 26, 8, 13]) [('AS', -101), ('XN', 0), ('XM', 19), ('XO', 1), ('XG', 5), ('NM', 24), ('MD', '0G0A0T0C59C8G19A56T1T36A5C8T2G20G7T10G16C0T2C1'), ('YT', 'UU')]
72NUT:00614:00539   16  0   0   0   3M17I77M1I160M  -1  -1  258 TAACGAACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTTTGGGGGGTATGCACGCGGATAGCATTGCGGGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCCATTATTTATCGCACCTACGTTCAATATTACAGGCGAGCATACTTACTAAAGCGTATTAATTAATTAATGCTTGTAGGACATAGCAATA  array('B', [12, 8, 12, 12, 12, 8, 12, 12, 12, 21, 25, 24, 31, 34, 34, 29, 29, 30, 30, 34, 34, 34, 30, 30, 29, 35, 35, 35, 30, 31, 31, 26, 32, 26, 34, 32, 32, 32, 30, 33, 29, 33, 29, 27, 29, 30, 33, 33, 32, 30, 24, 30, 30, 25, 29, 29, 34, 29, 28, 20, 34, 20, 23, 29, 29, 34, 23, 36, 30, 21, 25, 25, 32, 22, 33, 32, 33, 34, 34, 29, 34, 34, 16, 36, 36, 38, 37, 38, 34, 30, 26, 26, 22, 22, 23, 23, 14, 9, 14, 14, 25, 31, 31, 33, 34, 32, 34, 33, 32, 26, 32, 31, 31, 31, 31, 31, 36, 30, 31, 34, 34, 32, 30, 29, 29, 29, 25, 25, 25, 24, 29, 25, 29, 33, 31, 28, 29, 30, 34, 34, 34, 34, 34, 34, 30, 30, 30, 35, 34, 34, 28, 34, 32, 25, 25, 17, 25, 17, 31, 31, 28, 26, 29, 30, 34, 34, 34, 30, 34, 34, 34, 31, 35, 35, 30, 35, 35, 34, 34, 34, 34, 34, 34, 32, 34, 34, 34, 34, 34, 32, 34, 34, 31, 34, 34, 34, 31, 34, 34, 34, 34, 32, 35, 34, 34, 34, 34, 34, 34, 34, 34, 34, 31, 35, 35, 34, 34, 28, 35, 34, 34, 34, 34, 34, 34, 31, 34, 31, 34, 33, 35, 32, 34, 32, 36, 33, 34, 30, 30, 30, 31, 34, 34, 34, 34, 32, 34, 34, 34, 35, 35, 34, 35, 34, 31, 34, 34, 32])   [('AS', -125), ('XN', 0), ('XM', 13), ('XO', 2), ('XG', 18), ('NM', 31), ('MD', '0G1T60C8G19A56T1T36A5C8T2G27A0T4'), ('YT', 'UU')]

alignment next-gen sequencing • 4.1k views

ADD COMMENT • link updated 7.5 years ago by Devon Ryan 104k • written 7.5 years ago by ammarsabir15 ▴ 70

score 1 · Answer 1 · 2016-10-05

1

Entering edit mode

7.5 years ago

Devon Ryan 104k

You're looking for the pileup() function, which is equivalent to samtools mpileup ....

ADD COMMENT • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

what if I start from 1st base of the alignment file and save all the bases in an array or a datastructure like that because I want to align these bases to reference genome.?? Is it right methodology??

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

You're describing a less memory efficient version of the what I suggested. Why waste time coding that when it already exists?

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks,

It works fine for printing bases per position.

But I want to save these bases per position in a datastructure.

I tried to use list or string but all the bases are saved singly like 'A','T','G','C','C','C' at postion 1,2,3,4,5,6 etc but i want it to be saved like 'A','T','G','CCC' at position 1,2,3,4

In short I want the datastructure to be indexed according to the base position and all the bases at that position should be stored at a single position.

I have used this code but it is not giving correct results.

import pysam

prot=' '

samfile = pysam.Samfile( "reads.sorted.bam", "rb" )

for pileupcolumn in samfile.pileup():

     for pileupread in pileupcolumn.pileups:

         if not pileupread.is_del and not pileupread.is_refskip:

                 prot = prot + 'pileupread.alignment.query_sequence[pileupread.query_position]'

And the output comes as 'A','T','G','C','C','C','C' not as intended 'A','T','G','CCCC'

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

pileupcolumn contains all of the alignments that overlap a particular position, annotated by the offset inside the alignment to the appropriate base. If you want to store the results in a particular data structure then just do it. prot would need to be a list that is appended to with a string created anew every time the for pileupread in pileupcolumn.pileups: is encountered.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

I have tried this code

samfile = pysam.Samfile( "reads.sorted.bam", "rb" )
mystring = ''
mylist = []
for pileupcolumn in samfile.pileup():        
     for pileupread in pileupcolumn.pileups:
           if not pileupread.is_del and not pileupread.is_refskip:                
                     mystring = mystring + pileupread.alignment.query_sequence[pileupread.query_position]

for pileupcolumn in samfile.pileup():
        mylist =mylist + [mystring[pileupcolumn.pos:pileupcolumn.pos+pileupcolumn.n]]

But it is not working right and has following issues:

Length of the list i.e len(mylist) is 16568 whereas total base positions in the bam file were 16567.
It works rightly for only few indices i.e for first 8 positions where it gives 'G', 'A', 'T', 'C', 'A', 'C', 'A', 'GGG', 'GGG'
But for next postions it gives wrong output as 'GGG', 'GGGTT', 'GGTTT', 'GTTTCC', 'TTTCCCC', 'TTCCCCC', 'TCCCCCT', 'CCCCCTT', 'CCCCTTT'
Whereas the coverage in bam file is

cvrge at base 0 = 1

Rev G

cvrge at base 1 = 1

Rev A

cvrge at base 2 = 1

Rev T

cvrge at base 3 = 1

Rev C

cvrge at base 4 = 1

Rev A

cvrge at base 5 = 1

Rev C

cvrge at base 6 = 1

Rev A

cvrge at base 7 = 3

Rev G

Fwd G

cvrge at base 8 = 3

Rev G

Fwd G

cvrge at base 9 = 3

Rev T

Fwd T

cvrge at base 10 = 5

Rev C

Fwd C

Rev C

cvrge at base 11 = 5

Rev T

Fwd T

Rev T

cvrge at base 12 = 6

Rev A

Fwd A

Rev A

Fwd A

cvrge at base 13 = 7

Rev T

Fwd T

Rev T

Fwd T

Rev T

cvrge at base 14 = 7

Rev C

Fwd C

Rev C

Fwd C

Rev C

cvrge at base 15 = 7

Rev A

Fwd A

Rev A

Fwd A

Rev A

cvrge at base 16 = 7

Rev C

Fwd C

Rev C

Fwd C

Rev C

cvrge at base 17 = 7

Rev C

Fwd C

Rev C

Fwd C

Rev C

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

Please post an example BAM file somewhere, noting an example region where you program isn't showing the results you expect.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

I am not working on a specific region I am working on the whole mt genome , and want to save the coverage of whole bam file in the list as I tried in above code. Here is the link to the file Bam file

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

This looks like an XY problem, what exactly do you want to achieve and what is the aim of this analysis?

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

In order to get two separate list for positive and negative strands ,I have used this code that will extract two lists having bases at positions according to the coverage file as follows

`

         NegativeList = ['G','A','T','C','A',........]                
         PositiveList  = ['-','-','-','-','-','-','G','G','G'........]

`

This code puts '-' where a base is not present ,

`

       ReverseList = ['-'] * 16570
       ForwardList = ['-'] * 16570

   with open('gattaca.txt','r') as file:
       for line in file:
              if 'CVRGE' in line.upper():
         count = int(line.split('=')[-1].strip())
         base = int(line.split('=')[0].strip().split(' ')[-1])
   if 'REV' in line.upper():
         if '-' in ReverseList[base]:
               ReverseList[base] = line.rstrip()[-1]
    else:
               ReverseList[base] += line.rstrip()[-1]
   if 'FWD' in line.upper():
         if '-' in ForwardList[base]:
              ForwardList[base] = line.rstrip()[-1]
    else:
              ForwardList[base] += line.rstrip()[-1]

`

Its working rightly but I want to ask that if there is any more efficient way to achieve that ??

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

Why not just directly do that in the original script?

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

I tried to do it using pysam but it resulted in very absurd results ,
Can you help write code to achieve the above mentioned results i.e to save coverage data from negative and positive strands in negative and positive lists that are indexed according to the number of bases.?
Also tell if this can result in improving the efficiency ??

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

The only way to make it more efficient would be to write it in C, which would take more time than it's worth for you. You can figure out how to do this yourself, just make a small test file so you can run (and therefore debug) things quickly and easily.

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

I finally figured it out using this code which works fine for me i.e instead of creating a new file from pileup() extract the strand based coverage from within the loop for pileup() function, the code is given below,

  samfile = pysam.Samfile( filename, "rb" )
  ReverseList = [''] * lenref  #where lenref is length of reference passed as an argument
  ForwardList = [''] * lenref
  for pileupcolumn in samfile.pileup() :
        if (pileupcolumn.n <= 3):
               continue
        for pileupread in pileupcolumn.pileups:
               if not pileupread.is_del and not pileupread.is_refskip:
                      if pileupread.alignment.is_reverse: #negative
                                ReverseList[pileupcolumn.pos] += pileupread.alignment.query_sequence[pileupread.query_position]
                      else:
                                ForwardList[pileupcolumn.pos] += pileupread.alignment.query_sequence[pileupread.query_position]

But know I am stuck in a new problem i.e how to filter bam file based on mapping qualities which I have asked here here.

ADD REPLY • link 7.4 years ago by ammarsabir15 ▴ 70

score 0 · Answer 2 · 2016-10-17

0

Entering edit mode

7.5 years ago

Devon Ryan 104k

The following produces the correct nucleotide distribution per-base. If you're getting something else, then it's because you're coding it wrong.

#!/usr/bin/env python
import pysam

samfile = pysam.Samfile("reads.sorted.bam", "rb")
mystring = ''
mylist = []
for pileupcolumn in samfile.pileup("gi|251831106|ref|NC_012920.1|"):        
    mystring = ''
    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:                
            mystring = mystring + pileupread.alignment.query_sequence[pileupread.query_position]
    mystring = "".join(sorted(list(mystring)))  # Sort alphabetically
    mylist.append((pileupcolumn.pos+1,mystring))
samfile.close()
for (k, v) in mylist:
    print("{}: {}".format(k, v))

ADD COMMENT • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

I have a basic question about pileup(), it will also tell about N's at any base position, am I write??

ADD REPLY • link 7.5 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

Yup, if a read had an N then you'll see it (presuming the phred score filtering is appropriate).

ADD REPLY • link 7.5 years ago by Devon Ryan 104k

0

Entering edit mode

You have mentioned ID of the reference in samfile.pileup("gi|251831106|ref|NC_012920.1|") does it tells the loop to iterate to the last read of the reference which is in this case 16569 ??