struggle to get fasta files from ucsc goldenPath
0
0
Entering edit mode
29 days ago
Lila M ★ 1.2k

Hi all, I am very interested in get fasta seq (mrna) from a given genes IDs. My first approach has been to download from here https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ the refMrna.fa.gz file and then use bedtools getfasta -rna to retrieve it:

e.g: bedtools getfasta -fi refMrna.fa -rna -bed test.bed -fo test.fa.out

(bed file is eg: chr8 9999999 9999998)

However, after getting the error: 'WARNING. chromosome (chr8) was not found in the FASTA file. Skipping.' I decided to explore the ref mrna file and I realised the gene IDs are in format refseq format:

e.g: grep -o -E "^>\w+" refMrna.fa | tr -d ">" > NM_001383, NR_164501 ..

Now I am wondering if there is a smartest way to get the mrna fasta seq or if I should transform the gene ID using the UCSC tool (RefSeq all) which provides an output with lots of refseqs for the given gene and use that file to be passed as a bed file.

Any feedbacks on this issue?

Thanks!

ucsc getfasta fasta • 505 views
ADD COMMENT
1
Entering edit mode

If your intervals are for main chromosomes why are you using the mRNA file instead of the entire genome? If you have the gene ID's you could also use the table browser to get the sequence.

ADD REPLY
0
Entering edit mode

It is because I want to pull the mRNA fasta files and not DNA (which I guess is the case if I pull it from the entire genome)

ADD REPLY
0
Entering edit mode

mRNA file is not going to have the same co-ordinates as the main chromosomes. Can you clarify what coordinates are in your BED file?

ADD REPLY
0
Entering edit mode

So this is my approach: I want to interrogate gene PVT1, I know the coordinates for it is chr8:127794532-128101252 In UCSC table I selected the following options:

group: mRNA and EST
track: Human mRNAs
table: all_mrna

This prints out a big file, but I can select the coordinates of interest (my BED file <test.bed> ), which are:

chr8    127794532   128101253 LF385466
chr8    127794532   128101253 LF385467
chr8    127794532   128101253 MA621043
chr8    127794532   128101253 MA621044

Using that 'portion' of the file I can then do:

bedtools getfasta -fi mrna.fa -rna -bed test.bed  -fo test.fa.out

Is this approach not correct? What is your suggestion then?

ADD REPLY
1
Entering edit mode

If coordinates in your BED file refer to chromosomal locations then you need to use the whole genome file and get those sections by the method you noted above.

That said you should be able to get the sequence using table browser directly.

ADD REPLY
0
Entering edit mode

Could you please elaborate how to retrieve the sequences directly from the table browser? Thanks!

ADD REPLY
0
Entering edit mode

Can you select the output format as sequence instead of BED?

ADD REPLY
0
Entering edit mode

Of course ... But as 'expected' the range range=chr8:127794533-128101253 in the output sequence file belongs to: LF385466, LF385467, MA621043 and MA621044, which exactly the same sequences as my approach using the mrna.fa file.

ADD REPLY
0
Entering edit mode

Not sure if I am missing something but you want 4 copies of the same sequence with a different fasta header?

ADD REPLY
0
Entering edit mode

They are +ve and -ve strands, but identical two by two ... I can filter that out later on. This is probably a very naive question, but what the header (LF385466, LF385467 ...) means? I know the are the qnames, but why they can be identical sequences and different headers?

ADD REPLY
1
Entering edit mode

Looks like those appear to be "polycomb associated non-coding RNA". You probably know what they are.

https://www.ncbi.nlm.nih.gov/nuccore/LF385466.1/
https://www.ncbi.nlm.nih.gov/nuccore/MA621043.1/
https://www.ncbi.nlm.nih.gov/nuccore/MA621044/

ADD REPLY
0
Entering edit mode

Yes! I've just realised about it! Thank you so much for following this issue!

ADD REPLY

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6