Biostar Beta. Not for public use.
Consensus Fasta Sequence Of A Given Genomic Region For Individuals From 1000G
2
Entering edit mode
2.1 years ago
Pierre • 120
Spain

Hey guys,

I am trying to retrieve consensus fasta sequences of a number of regions for specific individuals from 1000G, and further to make a multiple sequence alignment.

The pipeline I'm using for this task is not working. I would appreciate your help to let me understand what's wrong with it - and preferably how to speed up things.

Say, I have the following reference sequence ( _reference.fasta_ ) for which I would like to retrieve the corresponding sequences for a number of 1000G individuals.

>hg19_knownGene_uc011dlx.1_7 range=chr6:29695734-29695883 5'pad=0 3'pad=0 strand=+ repeatMasking=none                                     GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG


Use samtools to get the sub-section of interest

  ./samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 6:29695734-29695883  > subsection.sam


Convert SAM to BAM ( _SAM header is present: 84 sequences_ )

./samtools view -bS subection.sam > subsection.bam


Sort and index the bam file

./samtools sort subsection.bam subsection_sorted.bam
./samtools index subsection_sorted.bam


Use the reference fasta sequence and the bam reads to create a consensus sequence

./samtools mpileup -uf reference.fasta subsection_sorted.bam | ./bcftools/bcftools view -cg - | ./bcftools/vcfutils.pl vcf2fq > subsection_consensus.fastq


Use seqtk to convert fastq to fasta ( _optional: while marking sites with <20x coverage_)

./seqtk seq subsection_consensus.fastq (-q20) > subsection_consensus.fa


Final output: a huge file (~60 MB) containing a sequence string of "n".

What I would like to have instead after a number of iteration across individuals would hopefully look like this:

>reference GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG
>NA00001 ......................A...........................G..
>NA00002 ......................................T..............


Thanks in advance for any help / suggestion.

2
Entering edit mode
19 months ago
France/Nantes/Institut du Thorax - INSE…

I suppose you're mixing the chromosomes from hg19 (with a 'chr' prefix) and the 1KGenomes chromosomes without 'chr' prefix; That is why you're getting some 'NNNNNNNNNN': the tool cannot find the chromosomes. Use the reference sequence of 1000genomes instead of hg19. http://www.1000genomes.org/faq/which-reference-assembly-do-you-use

0
Entering edit mode

Hi Pierre, thank you for your reply. To see whether it is indeed a problem related to reference sequence, I used Grch37 reference and tried to retrieve the first 10 bases of the following transcript chr2:72356367-72356377: >ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTG However, I ended up with an even larger fasta file containing only non-bases. Any idea why this is not working?

0
Entering edit mode

I am very likely pointing to the wrong data file, however, cannot really figure out the mistake. Since I'm only interested in the exome, I thought I might try this instead: samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/exome_alignment/HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam 2:72356367-72356377 This did not work.

0
Entering edit mode
2.7 years ago
greener • 10
United States

I am trying to do the same thing. I would like to generate a fasta file on a predefined region chr7:128575994-128592088 from the 1000 genomes project. Ideally I'd like to have the output of that region from each individual. It would be helpful to know if the above approach was every successful (and how they got it to work) or if folks could provide me tips on how to complete this. Any help or suggestions are appreciated.

Thanks