Question

Consensus Fasta Sequence Of A Given Genomic Region For Individuals From 1000G

2

Entering edit mode

10.5 years ago

Pierre ▴ 130

Hey guys,

I am trying to retrieve consensus fasta sequences of a number of regions for specific individuals from 1000G, and further to make a multiple sequence alignment.

The pipeline I'm using for this task is not working. I would appreciate your help to let me understand what's wrong with it - and preferably how to speed up things.

Say, I have the following reference sequence (reference.fasta) for which I would like to retrieve the corresponding sequences for a number of 1000G individuals.

>hg19_knownGene_uc011dlx.1_7 range=chr6:29695734-29695883 5'pad=0 3'pad=0 strand=+ repeatMasking=none                                     GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG

Use samtools to get the sub-section of interest

  ./samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 6:29695734-29695883  > subsection.sam

Convert SAM to BAM (SAM header is present: 84 sequences)

./samtools view -bS subection.sam > subsection.bam

Sort and index the bam file

./samtools sort subsection.bam subsection_sorted.bam
./samtools index subsection_sorted.bam

Use the reference fasta sequence and the bam reads to create a consensus sequence

./samtools mpileup -uf reference.fasta subsection_sorted.bam | ./bcftools/bcftools view -cg - | ./bcftools/vcfutils.pl vcf2fq > subsection_consensus.fastq

Use seqtk to convert fastq to fasta (optional: while marking sites with <20x coverage)

./seqtk seq subsection_consensus.fastq (-q20) > subsection_consensus.fa

Final output: a huge file (~60 MB) containing a sequence string of "n".

What I would like to have instead after a number of iteration across individuals would hopefully look like this:

>reference GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG
>NA00001 ......................A...........................G..
>NA00002 ......................................T..............

Thanks in advance for any help / suggestion.

1000genomes consensus samtools • 5.7k views

ADD COMMENT • link updated 8.5 years ago by greener ▴ 10 • written 10.5 years ago by Pierre ▴ 130

score 2 · Answer 1 · 2013-10-31

2

Entering edit mode

10.5 years ago

Pierre Lindenbaum 161k

I suppose you're mixing the chromosomes from hg19 (with a 'chr' prefix) and the 1KGenomes chromosomes without 'chr' prefix; That is why you're getting some 'NNNNNNNNNN': the tool cannot find the chromosomes. Use the reference sequence of 1000genomes instead of hg19. http://www.1000genomes.org/faq/which-reference-assembly-do-you-use

ADD COMMENT • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, thank you for your reply. To see whether it is indeed a problem related to reference sequence, I used Grch37 reference and tried to retrieve the first 10 bases of the following transcript chr2:72356367-72356377: >ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTG However, I ended up with an even larger fasta file containing only non-bases. Any idea why this is not working?

ADD REPLY • link 10.5 years ago by Pierre ▴ 130

0

Entering edit mode

I am very likely pointing to the wrong data file, however, cannot really figure out the mistake. Since I'm only interested in the exome, I thought I might try this instead: samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/exome_alignment/HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam 2:72356367-72356377 This did not work.

ADD REPLY • link 10.5 years ago by Pierre ▴ 130

Ram · Answer 2 · 2015-11-11

I am trying to do the same thing. I would like to generate a fasta file on a predefined region chr7:128575994-128592088 from the 1000 genomes project. Ideally I'd like to have the output of that region from each individual. It would be helpful to know if the above approach was every successful (and how they got it to work) or if folks could provide me tips on how to complete this. Any help or suggestions are appreciated.

Thanks