Extracting The Base/Nucleotide Across All Reads At Particular Position/S.
1
5
Entering edit mode
11.9 years ago
Empyrean ▴ 170

Hello everyone. I will try to explain with an example of what i need.

We have vcf file to get some positions of snps.

#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ

and a bam file with the alignments to reference. the approximate depth is 20x and all are 454 reads. Now from this information, i wanted to extract all the bases at the particular positions.. I need output of something like this.

Read Name   Position1   Position2   Position3   Position4
            14370       17330       1110696     1230237
Read1         G           T             A           T
Read2         A           A             G           .
Read3         A           A             G           .
Read4         G           A             A           T
Read5         A           T             T           .
Read6         A           T             T           .
Read7         G           A             A           T

I would like to build such graph across all reads with specific position. Can this be possible by samtools mpileup/vcf tools or does anyone has any script written to solve such problem. From this information i will be extracting the haplotype information for specific genotypes.

samtools vcftools • 5.8k views
ADD COMMENT
0
Entering edit mode

I did something similar a long time back. wouldn't the sff tools from 454 have something to help?

ADD REPLY
0
Entering edit mode

I looked in to it but its of not much help !! can we do this with samtools or vcftools?

ADD REPLY
3
Entering edit mode
11.9 years ago
Rm 8.3k

I had an old script extract.allele.4m.strand.pileup.pl) based on samtools.pl and uses samtools "pileup" output: it prints the read sequence at a position and the allele counts too:

##################
#!/usr/bin/perl 
##########################################################################################
# Script to extract alleles from Pileup file.                                       #
# It is Based on samtools.pl:        #
##########################################################################################

use strict;
#use warnings;

&usage if (@ARGV < 1);

my $command = shift(@ARGV);
my %func = (pileup2allele=>\&pileup2allele);

die("Unknown command \"$command\".\n") if (!defined($func{$command}));
&{$func{$command}};
exit(0);

# pileup2allele

sub pileup2allele {
  die(qq/
Usage:   extract.allele.4m.strand.pileup.pl pileup2allele <input.chr.raw-pileup>
/) if (@ARGV == 0 && -t STDIN);

my @qual;
my @line;
my $phred=10; # Phred quality threshold
my $sanger=33; # Sanger quality=33 # solexa/illumina=64 

# Reading input pileup 
  while (<>) {
    @line = split;
      if ($line[2] eq "*"){next;} # skipping indel row #
      $line[8] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVNF]+//g; # Removes unwanted extra-indels notation from read base lines
      $line[8] =~ s/\^.//g; # Removes unwanted symbols
      $line[8] =~ s/[0-9\@\'\&\!\?\_\$\*\:\;\F\"\+\#\-\=\%\)\(\/\~\}\[]+//g; # Removes unwanted symbols 
      $line[8] =~ s/\./$line[2]/g; # Substituting . with reference base 
      $line[8] =~ s/\,/lc($line[2])/ge; # Substituting , with lower case reference base 

#      $line[8] = uc($line[8]); # Converting to UPPER case

      # Store quality scores as an array
      @qual = split(//, $line[9]);
      # Transform quality values from ASCII into Sanger format
      for( my $i = 0; $i < scalar @qual; $i++ ){
       $qual[$i] = ord($qual[$i]) - $sanger ;
           }


    my $char;
    my %chars=();
    my $chars;
    my @letters = split(//, $line[8]);
    my $top=0;
    print "$line[1]\t$line[8]\t"; # print nucleotide position and the Sequence from the READS

    # Quality comparison 
    foreach $char(@letters) { if ($qual[$top] >= $phred){ $chars{ $char }++; } $top++; }
    my @keys = sort { $chars{$b} <=> $chars{$a} } keys %chars; # sorting higher to lower value
    foreach my $key(@keys){ print $key, " ", $chars{$key},"\t";}
    print "\n";

  } # End of while

 } # End of pileup

##################

sub usage {
my $str ='Sample Input Pileup Data
10      56256   A       A       75      0       21      16      .,+4gcgc,,,,,,..,...,,  13?5??>5<9=71570';
  die(qq/
Usage:   extract.allele.4m.strand.pileup.pl pileup2allele <input.chr.raw-pileup>  \n
$str \n/);
}

# ----------------------------------------------------  

#2. If pipe the pileup through samtools output
##samtools pileup -cf reference.fasta  Input.bam | ./extract.allele.4m.strand.pileup.pl pileup2allele >OUTPUT.TXT

# ----------------------------------------------------  

## End of the Script ######

Sample Out put:

 Position Sequence_on_the_read(s)                                                 Allele_counts
    3308    TTttTctTTttTtTTtTTTtTtTTTtttTtTtTtttcctTttctttcctcTTttttTTttttTTtTt     t 33    T 26    c 7
    3309    cccccccccccc    c 12
    3311    CccCcCCgGCgcCCcCCCccccCCCCcGCcCCGGcCCgccccCgcCCCCcCcGc  C 25    c 20    G 5     g 4
    3312   CccCaCccccAcCAacCCaCcccACCCCCccCCcccCcCCCcccccCCCccCca  c 25    C 22    a 4     A 2
ADD COMMENT
0
Entering edit mode

Thank you for your answer.. will this work on mpileup output? i have aligned bam file and vcf file which is output from mpileup. will this print read name? i dont want to count the alleles but to display them . I can get this sequence information from bam file using Samtools -r [position] [filename] output but there also it doesnt print the read name..

ADD REPLY
0
Entering edit mode
  1. if you change "$line[number]" to appropriate column numbers from mpileup it should work.
  2. pileup or mpileup don't contain the read name. For regions of interest, use BamToBed to get bed file with reads. Then use intersectBed to merge reads.bed and above output.
ADD REPLY

Login before adding your answer.

Traffic: 2603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6