remove dubplicate from ensemble seqeunce?
1
0
Entering edit mode
7.3 years ago
najibveto ▴ 110

hello

i am trying to retrieve some genes from ensemble database using the biomart tool http://www.ensembl.org/biomart/martview/a5ef51e4e51a2077dbe7e13e2a015f6a

i got the genes but i got them duplicated or sometime without sequence .

can someone tell me how to remove the duplicates and genes without sequence? or how to get unique sequence for each gene using the biomart tool?

thanks for your help.

sequences Ensembl • 2.4k views
ADD COMMENT
0
Entering edit mode

Can you provide some additional detail about what you are doing to get to this point? Perhaps it is something that you are doing (perhaps incorrectly) that is causing this to happen. Genes are represented by multiple splice variants so they may appear as multiples. You should take that into account.

ADD REPLY
0
Entering edit mode

thanks for your kind reply. first i am trying to retrieve the 3'utr of genes involved into rig-i-like pathway in zebrafish.

can you help me through biomart tool to get unique sequence such for each gene id just one sequence ? thanks for your help.

ADD REPLY
0
Entering edit mode

Have you tried to use the gene names (ddx58, ifih1, mavs, dhx58 etc derived from your list) and then selecting "input external reference ID list" as the filter)? In attributes --> sequence --> Select one of the gene options to get the full length representation.

ADD REPLY
0
Entering edit mode

thanks but i got the same results duplicate of sequence or gene name without sequence as shown in the pic enter image description here

i did as you said enter image description here

enter image description here

ADD REPLY
0
Entering edit mode

On the final page where you download the results there is a check box that says unique results. Check that and then download the result file.

ADD REPLY
0
Entering edit mode

i already did it before but it gave me same problem.

ADD REPLY
0
Entering edit mode

If the sequences are identical you could use dedupe.sh from BBMap to remove them. It looks like the gene names are identical (hard to tell from the pictures) but there are some numbers after the name that look different so they must represent different entries. You could try leaving just gene names in your output and unchecking all other identifiers.

ADD REPLY
0
Entering edit mode

thanks for your help. i will do as you suggested

ADD REPLY
2
Entering edit mode
7.2 years ago
Emily 23k

They're different transcripts of the same gene. Some transcripts may have the same 3' UTR, different 3' UTRs or no 3' UTRs. Here's an example gene where some transcripts have identical UTRs and some have different. You didn't do anything wrong, nor is BioMart going wrong. This is just the data.

I suggest putting the transcript ID in the headers so that you can differentiate between them.

ADD COMMENT
0
Entering edit mode

i understand now. thanks for your explanation and sorry for my late reply.

ADD REPLY

Login before adding your answer.

Traffic: 2683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6