how to resolve this 1-bases cordinates confusion
2
0
Entering edit mode
6.6 years ago

Assalam o alaikum everyone,

I have fetched CDS sequence from whole genome sequence of dog which is downloaded from NCBI. CDS sequence comes in parts as shown below: e.g.

$ cat FABP5_CDS

>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA

I have further process and concatenate these parts then i have found that CDS not started from start codon (ATG) this is due to 0-based and 1-baed coordinate system (BED and my BAM file is 1-based ).

I have to add 1 base at the start of my CDS part e.g.

Before adding one base:

>chr29:28363101-28363137
TGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA

After adding one base (A): (now its start from ATG)

>chr29:28363101-28363137
ATGACTGTGTCAGTCCAGGTTCTCTGGGGGACTGAGG
>chr29:28491275-28491447
AGTGGGAATGGCTCTGCGAAAGGTGGGTGCAATGGCCAAACCAGATTGTATCATCTCTTCTGACGGCAAAAACCTCACCATAAAA
>chr29:28491806-28491907
CTGTCTGCAACTTCACAGACGGCGCATTGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACAAGAAAGTTGGAAGATGGGAAATTGGTGGTG
>chr29:28492441-28492494
AATGCGTCATGAACAATGTCACCTGTACGCGGATCTATGAAAAAGTAGAGTAA

My question is that should i add one base at the start of each CDS part or at the start of first CDS part only ?? I'm too much confused. Any idea how to fix it ??

1-based 0-based construct CDS sequence • 1.8k views
ADD COMMENT
1
Entering edit mode

What genome build is this from?

According to Ensembl FABP5 is a pseudo-gene in Dog (CanFam v.3.1) with 3 exons

..........ctgggcttgctacagcgctgatcatagaatcctcttcaattccagctgga

ATGCTGTGTCAGGCACTTCACAGATTTGGTCAAAAGCTGGTACGCAGACGTACATTGAAG
CAAGATGTGACCCAGATCAGATATTTGAACACACTGGATTTAGTGACCCTGGGTGTGGAC
CACACAGTGGGTATAGGTGTATATGTCCTGGCTGGGGAGGTGATCAGTAATCAAGCAGGA
CCTTCGATTGTGATCTGCTTTTTGGTGGCTGGCCTAGCCTCGTTGTTGGCTGGGCTGTGC
TATGCAGAGTTGAGTATCCGGATTCCTCATGCTGACTCTGCATATGTCTACACCTATGTC
ACTGTAGGTGAACTTGGTGCTTTTGTCACTGGCTGGAACCTCCTCCTCTTCCTTGTTGCT
GATGGAGTTGTGTTGGGTTGGGTGTGGATGTTAATTTTTGACAACCTGCGTGGGGACCAG
ATATCTGAGACCCTGACTGAGAACATTTCATCATATGTTTCCCGTGTCTTTGAAAAATAT
CTAGGCTTCTTTGTTACGTGTTTTGTATTCTTCCTCACTGATTTCTGGTATCTGTGGGTT
TTTGAGTGTTCCCAGATTTCCAAATGGTTCACATTGGTTAAAGTTTTCTTTCTCAGTTTT
GTCATCATCTCTGGCATCATTAAGGGATCTGCGCAACTGGAAGCTCACAGAAGAGGACTA
CGTGAAGGCTGGACTCAATGACACCTCTAGTTGAGCCCTCTGGGCTCTGGAGGATTCATG
CCTTTTGGCTTCCAGGGGATTTTCCGTGGTGCAGCTACCTGCTTCTATGCTTTTGTTGGT
TTTGACAACATCGTGACCAGAGGTAAAGTAACCCAGAATCCCCAGCATTCTATCCCTATG
GGCATTGTGATTTCACTGTTCATCAGCTCTTTGTTGTATTTTGGTATCTCTGCAGCACTT
ACACTTATGGTGCCTTACTACCAGCTTCGACCTGGTAGCCCCTTGCCTGACGTATTTCTC
CATATTGGCTGGGCTCCTGCCTTCTATGTT                              

gtaacttttggatttttctgttttc..........aaatatgtgcatatgtgctttacag

AATAAAACCTCTGAATTAAAAAAAAAAATGGCCAAACCAGATTGTATCATCTCTTCTGAC
GGCAAAAACCTCACCATAAAAACTGAGAGCACTTTGAAAACAACACAGTTTTCGTGTAAT
CTGGGAGAGAAGTTTGAAGAAACTACAGCTGATGGCAGAAAAACTCAGACTGTCTGCAAC
TTCACAGACGGCGCATGGGTTCAACATCAGGAATGGGATGGGAAGGAAAGCACAATAACA
AGAAAGTTGGAAGATGGGAAATTGGTGGTGGAATGCGTCATGAACAATGTCACCTGTACG
CGGATCTATGAAAAA                                             

gtagagtaaaaattccatcatcatt..........gacctattttggcacattcacccag

GTTGTGGTCATCGTGATCATTTGTGTTATTGCAGCAGTCATGACATTCTTCTTTGGACTC
ACTTATCTTGTGGACCTCAGTGCAATTGGGTCCCTGACACCTCACTCTCTTGATGCTATT
TGTGTACTCATCCTCAAGTATCAGCCTGAGAAGAAGAATGAGTGA               

aatgaagcacaggtactggaggagaatgggcctatggcagagaagctgac..........
ADD REPLY
0
Entering edit mode

FABP5 is not mentioned as pseudogene according to information given in NCBI for dog genome.

ADD REPLY
0
Entering edit mode

Are you certain those are CDS features? They don't start with canonincal start codons, nor do they look like they all have stop codons.

ADD REPLY
0
Entering edit mode

Yes, I'm certain about it. And in the above example all sequences are the parts of a single CDS sequence and there is a stop codon (TAA) at the end of the last part.

ADD REPLY
1
Entering edit mode

Oh its a single CDS? That's an odd way to depict the sequence. Then to answer your question you should only add an A to the first part of the sequence, where the ATG would be.

ADD REPLY
0
Entering edit mode

yes, its a single CDS. I have fetches these sequences from whole genome.Actually i'm confused due to these parts I have coordinates file for extracting a Whole CDS like below and this file format is 1-based.

chr29   28363101    28363137    .   .   .
chr29   28491275    28491447    .   .   .
chr29   28491806    28491907    .   .   .
chr29   28492441    28492494    .   .   .

my point is that why to add a base for only first coordinate why not for all parts ???

ADD REPLY
1
Entering edit mode

Because if, as you say, each sequence is PART of the CDS, and not the CDS itself, genes start with an ATG. If the 0-based numbering affects the sequences afterwards too, you don't know what base needs adding so you can't just put an A in there. You have an additional problem, that if they've all have the 1st position base deleted, you won't know what to replace it with, and if your sequences aren't a multiple of 3 for each, there will be frameshifts in it too.

ADD REPLY
0
Entering edit mode

Thank u for reply

you don't know what base needs adding so you can't just put an A in there

Actually this is not the problem that what base should add because we can find the correct base by changing the the first coordinate e.g.

first coordinate is 28491275 -> 28491274 so by reducing one we can find correct base. I have tested it for ATG and its always A so i put A there.

But I'm not clear that whether I should add one base for others parts or not ?? have You any idea how can i test it ???

ADD REPLY
0
Entering edit mode

I'm still not really seeing the problem - my apologies. Maybe I'm being really stupid.

I'm not sure I can really help you, unless you know whether or not the off-by-one error is affected them all or not a priori. It might be easy to fix depending on your dataset.

The last sequence in your example is around 100 kilobases separated from the first sequence in the sample, so it seems pretty unlikely to me that they're part of the same CDS. I'm no eukaryote expert, but that seems like a lot even taking in to account introns.

Do you have a fasta sequence of the whole, uninterrupted sequence we can see so that we can understand what these sequences represent?

ADD REPLY
0
Entering edit mode

I have genrated consensus FASTA from BAM file. My BAM file is aligned aginst canfam3.1 so i have downloaded annotation file of canfam3.1 from NCBI and used Coordinates for CDS extarction.

ADD REPLY
0
Entering edit mode

If this is a published genome, why not just download the gff or genbank and extract the CDSs from that as 1 continuous sequence?

ADD REPLY
0
Entering edit mode

No, this is not a published genome.

ADD REPLY
0
Entering edit mode

But you said you downloaded it from NCBI?

ADD REPLY
0
Entering edit mode

ohhhh sorry 4 that :o above example not from published genome.

but i have also tried it for dog genome which is available on NCBI same problem for published genome.

ADD REPLY
2
Entering edit mode
6.6 years ago
Joe 21k

If I understand the problem then I'm not sure you can easily resolve this without more information.

Your off-by-one issue presumably affects all the sequences, but we can't know that for sure unless you can't find out from the software authors or something unequivocally, or you have the whole reference sequence to compare to.

If we assume it does,the first base of the middle sequences can be restored by taking the last base from the sequence preceding it. For the first sequence however, you can't simply assume it needs an A base adding. It's likely, but not all genes start with an ATG, there are other possible start codon. It looks like the last sequence still has its stop codon, so the last sequence is probably fine too.

This really doesn't seem like the best way to do this to me. You should really check it against the full sequence though.

ADD COMMENT
0
Entering edit mode

Thank u so much your answer is helpful.

ADD REPLY
1
Entering edit mode
6.6 years ago

BED and my BAM file is 1-based

BED and BAM are usually 0-based, half-open [start-1, end). I'd start there, as errors there can cause grief downstream.

ADD COMMENT
0
Entering edit mode

Yes, you are right i guess problem is in my file format.

actually i have fetched column 4 and 5 from gff3 (annotation) file and made a bed6 file then i have used bedtools getfasta for getting FASTA sequence.

This is wrong approach. I should have to convert gff to bed then used it for sequence fetching. After testing it for multiple genes i will paste it here.

ADD REPLY

Login before adding your answer.

Traffic: 2431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6