Question

Substituting Snps On Reference Genome Assemblies

1

Entering edit mode

10.9 years ago

Sakti ▴ 510

Hi again guys!

So this time with a new question. I have a list of variant calls (basically, SNPs) for a mouse genome sample. SNP positions are reported using the reference mouse genome sequence. What I would like to do is to take this list and substitute each reported SNP position on the corresponding reference mouse genome sequence as to obtain my own "genome" with the corresponding SNPs. However, I don't know how to do this task in an efficient manner just using my current programming skills.

I was thinking on storing each bp as part of an array, and then reading my SNPs file and then changing, but wouldn't storing a whole chromosome sequence be too much?

I'd like to know if you know of any other better way :)

Thanks!!

mouse genome snp perl • 8.7k views

ADD COMMENT • link updated 5.8 years ago by nataliagru1 ▴ 90 • written 10.9 years ago by Sakti ▴ 510

Ram · Answer 1 · 2013-05-20

This link provides the tool to create strain specific reference genome for effective mapping of the reads.

http://alleleseq.gersteinlab.org/tools.html

Download personnel genome constructor. This script will take vcf file and reference genome as input and will create enhanced reference genome as output. As this script is meant for human, two versions including maternal and paternal for each chromosome will be created. But as reference genome of mouse is an inbred organism, it has the same allele on both the chromosomes. so you can consider any copy either paternal or maternal. But make sure your vcf file has the homozygous variant tagged as 1/1 for GT tag and you should remove any heterozygous variants if any in the vcf file to skip the unnecessary complications.

I also have this script that I wrote a long ago. If you know very basic python you should be able to modify it to work for your case. It exactly does what you need. Here is the link: https://github.com/ashutoshkpandey/SimplePrograms/blob/master/psuedogenome_SNPs.py

Ram · Answer 2 · 2013-05-20

1

Entering edit mode

10.9 years ago

JC 13k

Loading a full chromosome as an array is never a good idea, you can load the full chromosome as a "string" and modify the positions with "substr", something like this:

#!/usr/bin/perl

my %pos = loadPositions($file); # returns a hash like this $pos{'chr1'}{121726} = 'G'
my %seq = loadSequences($fasta); # returns a hash like this $seq{'chr1'} = 'ATCGATCATCTACTA....';

while (my ($chr, $seq) = each %seq) {
    foreach my $pos (keys %{ $pos{$chr} }) {
        substr($seq, $pos - 1, 1) = $pos{$chr}{$pos}; # be careful with 0-based and 1-based coordinates
    }
    printFasta($seq);
}

In this case, I'm loading the full genome, each chromosome as a "string", but you can optimize the code to do it chromosome by chromosome.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.9 years ago by JC 13k

0

Entering edit mode

I think sorting the positions in reverse would be prudent, or indels will throw off your numbering. If you work from the back to the front, your numbering will stay intact.

ADD REPLY • link 10.9 years ago by swbarnes2 14k

0

Entering edit mode

Thank you very much JC!! Will give it a try, but where is the loadSequences function? Is that part of a library in perl?

ADD REPLY • link 10.9 years ago by Sakti ▴ 510

0

Entering edit mode

that was a simple example for use "substr" to change a position in a string, it is not the complete script

ADD REPLY • link 10.9 years ago by JC 13k

score 0 · Answer 3 · 2016-11-25

0

Entering edit mode

7.4 years ago

FatihSarigol ▴ 250

Anyone still coming to this page, if you have a fasta reference genome and a bam file that you want to turn into the reference file by changing SNP's and N's, you may try this one-liner using samtools, bcftools and vcfutils.pl (ps for beginners: both samtools and bcftools can be compiled in a computing cluster or in Linux, if so just add the locations of each before the software name; vcfutils is already a perl script from bcftools)

samtools mpileup -d8000 -q 20 -Q 10 -uf  REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq  > OUTPUT.fastq

-d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)

Which generates a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:

sed -i -e '/^+/,/^\@/{/^+/!{/^\@/!d}}; /^+/ d; s/@/>/g' OUTPUT.fastq

In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.

EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

ADD COMMENT • link 6.2 years ago by FatihSarigol ▴ 250

0

Entering edit mode

After running mpileup and the sed command, the output.fastq can then be successfully converted to .fasta?

Thanks for touching on this process, it is very helpful for my work.

ADD REPLY • link 6.8 years ago by TrentGenomics ▴ 30

0

Entering edit mode

My version of samtools 1.4 doesn't have vcfutils.pl. It does have samtools.pl, which has pileup2fq, so you might want to try that.

You can make your own pileup2fa pretty easily, by editing the samtools.pl script. perl scripts are plain texts, so you can look at them easily.

Find the the code below, and change this

print "\@$chr\n"; &p2q_print_str($seq);
  print "+\n"; &p2q_print_str($qual);

To

print "\>$chr\n"; &p2q_print_str($seq);
  #print "+\n"; &p2q_print_str($qual);

ADD REPLY • link 6.8 years ago by swbarnes2 14k

0

Entering edit mode

Would 'bcftools consensus' be a similar function to the above commands?

ADD REPLY • link 6.7 years ago by TrentGenomics ▴ 30

0

Entering edit mode

michbrown, sure you can convert fastq to fasta afterwards, try this simple awk script for example:

awk 'NR%4==1{print ">"substr($0,2)}NR%4==2{print $0}' file.fastq > file.fasta

I never tried bcftools consensus, looks similar to bcftools -c for output of samtools mpileup, did you try consensus? I also see now on bcftools webpage this

"Users are now required to choose between the old samtools calling model (-c/--consensus-caller) and the new multiallelic calling model (-m/--multiallelic-caller). The multiallelic calling model is recommended for most tasks."

so one should give that a try, too.

ADD REPLY • link 6.7 years ago by FatihSarigol ▴ 250

0

Entering edit mode

swbarnes2, looks like a good solution, I also like changing scripts of softwares written in python sometimes. vcfutils.pl comes with bcftools in 2 versions I have used, check the installation folders: (( bcftools-1.4/vcfutils.pl )) (( bcftools-1.3/bin/vcfutils.pl )) (but I did hear that samtools also had a similar script before)

ADD REPLY • link 6.7 years ago by FatihSarigol ▴ 250

0

Entering edit mode

why did you set -d to 8000 dont you want a minimum reads of 10 or max 250? 8000 seems a bit excessive right?

I am still learning what is the best way to execute mpileup. So I am wondering why you set read depth to 8000. Thank you

ADD REPLY • link 5.8 years ago by nataliagru1 ▴ 90

0

Entering edit mode

Hi Natalia, I guess this is a question for my code up there. That 8000 was the default value of maximum depth selected by samtools mpileup (I just checked it still seems to be the same in the manual) I guess I had written that to indicate an available option that is sometimes important to change, no reason behind setting it to default. It is "max per-file depth" that "avoids excessive memory usage"

EDIT update for samtools 1.9 version, quoting from github page: "Samtools mpileup now handles the '-d' max_depth option differently. There is no longer an enforced minimum, and '-d 0' is interpreted as limitless (no maximum - warning this may be slow). The default per-file depth is now 8000, which matches the value mpileup used to use when processing a single sample. To get the previous default behaviour use the higher of 8000 divided by the number of samples across all input files, or 250."

ADD REPLY • link 5.7 years ago by FatihSarigol ▴ 250

1

Entering edit mode

Thank you so much!!!

Greatly appreciated!!

ADD REPLY • link 5.7 years ago by nataliagru1 ▴ 90

0

Entering edit mode

Do not add answers unless you're answering the top level question. Use the Add Comment/Add Reply options instead. Read posts under https://www.biostars.org/t/how-to to understand how to use the forum.

ADD REPLY • link 5.7 years ago by Ram 43k