Is There A Tool To Extract The Reference Genomic Sequence From A Given Coordinate?
2
0
Entering edit mode
10.7 years ago
bibb77 ▴ 90

Hello everyone, this is my problem:

I have a list of SNPs coordinates in this format (from chr 1 to chr Y, ~580k SNPs), they are under build GRCh37:

chr1:108681808
chr1:109440678
chr1:109479801
chr1:110655430
chr1:11193226
chr1:113933669
chr1:115258741
chr1:115527488
...

And I have the Reference Genome build37 in .fa format

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNN....gctttatacaatctat
ttgttactttttattctattttgcatttt
gttcctttgcctgaataattcactttggt
ctgcaatggctaattgcaatagattt...

Is there tool to obtain the sequence corresponding to each given coordinate?

chr1:108681808  A
chr1:109440678  C
chr1:109479801  T
chr1:110655430  G
chr1:11193226    A
chr1:113933669  T
chr1:115258741  C
chr1:115527488  C

Hope you can help me

reference vcf coordinates • 4.6k views
ADD COMMENT
1
Entering edit mode
10.7 years ago
Gabriel R. ★ 2.9k

You have a relatively small number of positions, I would just do

  1. put all your fastas into one, samtools faidx your reference
  2. do a for loop in bash where you replace chr1:108681808 to chr1:108681808-108681808, a quick awk should wor, see below:
  3. for each, do a faidx on the reference:

    for i in cat file.pos |awk 'BEGIN{FS=":"}{print $1":"$2"-$2}'

    do

    echo -np $i"\t";

    samtools faidx reference.fa $i;

    done

ADD COMMENT
0
Entering edit mode
10.7 years ago
SRKR ▴ 180

What is the genome size that you are working with... I mean the sequence in the .fa file. If it's reasonably small. I can create an online tool for the purpose.

ADD COMMENT

Login before adding your answer.

Traffic: 2442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6