Batch download of hg19 sequences in bed file
2
1
Entering edit mode
7.3 years ago
bioguy24 ▴ 230

I am trying to identify runs of homopolymers in sequences of a bed and have a perl script that works for the desired output below. There are ~55,000 lines in the custom.bed, is there a way to download all the sequences in the bed in the desired output format? UCSC table browser will give the output in the correct format however there is a limit of 1,000 entries per query. The below seems close but the output is formatted incorrect. Thank you :).

custom.bed
chr1    948953  948956  chr1:948953-948956  .   ISG15

BED=custom.bed
for chr in `seq 1 22` X Y
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz | gunzip -c >> hg19.fa
done

 fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

current output:

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

desired output of each line for fasta sequence:

>hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none
 gccggacaacATGGAGGCAGCGCCGCCCGGGCCGCCGTGGCCGCTGCTGC
 TGCTGCTGCTGCTGCTGCTGGCGCTGTGCGGCTGCCCGGCCCCCGCCGCG
 Ggtaggtgggc
ngs FASTA Bed • 3.0k views
ADD COMMENT
1
Entering edit mode

Are the fields in custom.bed tab-separated? You are not going to get the header you want (hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none, unless this is a dummy example) from the command you are using.

ADD REPLY
5
Entering edit mode
7.3 years ago
genecats.ucsc ▴ 580

Hello, cmccabe.

Thank you for your question about getting FASTA sequences from the UCSC Genome Browser.

Are you uploading your "custom.bed" as a custom track and then attempting to get the FASTA sequence from the Table Browser? You are correct that there is a limit of 1000 regions if you are using the Table Browser's "define regions" option, but for custom tracks, there is no such limit.

Here are some basic steps that you could use to get the sequence you're interested in:

  1. Upload your "custom.bed" as a custom track: https://genome.ucsc.edu/cgi-bin/hgCustom?db=hg19
  2. After uploading, navigate to the Table Browser:https://genome.ucsc.edu/cgi-bin/hgTables.
  3. Make the following selections: clade: Mammal genome: Human assembly: Feb. 2009 (GRCh37/hg19) group: Custom Tracks track: My Custom Track table: myCustomTrack output: sequence output file: enter a file name to save your results to a file, or leave blank to display results in your browser

  4. Click "get output".

  5. Select your sequence output options.
  6. Click "get sequence".

You should end up with a FASTA with names similar to what you are looking for in your "desired output".

If you have any follow up questions, it would be helpful if you could post them to our Google Groups forum: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome, that way our whole team can see the question and help with an answer.

Thanks,

Matthew from the UCSC Genome Browser

ADD COMMENT
0
Entering edit mode

Thank you all very much :)

ADD REPLY
2
Entering edit mode
7.3 years ago
Chun-Jie Liu ▴ 280

From your script, you want to download all hg19 fasta, then use bedtools to extract from fasta. Your script is not correct.

for chr in `seq 1 22` X Y M
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${chr}.fa.gz &
done
wait

gunzip -c *fa.gz > hg19.fasta

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6