Question

interpreting fasta header

0

Entering edit mode

5.3 years ago

genya35 ▴ 40

Hello, I have a text file with thousands of unique sequences in fasta format. Each read has a header in the following format:

122391_Tcount2352_Acount2352_Bcount0_length293

It's obvious that 'length' represents the length of the read but all the other numbers are not clear. I do not know which tool was used to generate the file but blastn was used as some point in the pipeline. I'm curious to see if anyone here has encountered this header format before and can tell me which part of the sequence header represents the count of reads.

Thanks for your help in advance,

Lena

alignment • 1.4k views

ADD COMMENT • link 5.3 years ago by genya35 ▴ 40

1

Entering edit mode

Hi Lena,

Can you tell us the tool that provided those fasta headers for you? That might help us know what "Tcount", "Acount" and "Bcount" mean.

Thanks!

ADD REPLY • link 5.3 years ago by Josh Herr 5.8k

0

Entering edit mode

Identifying possible tools from the header style/format is the whole question...

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Lena,

Take a few separate sequences, put it to Blastn or Blastx. It may become clearer what organism you deal with. Then look at NCBI - who has sequensed it. You may even find some articles describing it. Good luck!

ADD REPLY • link updated 5.3 years ago by Ram 43k • written 5.3 years ago by natasha.sernova ★ 4.0k

3

Entering edit mode

How does this help with the question about the information in the header?

ADD REPLY • link 5.3 years ago by ATpoint 81k

1

Entering edit mode

Lena said, she had thousands of unique sequences.

If it is published, if the source is known - one way is just ask the authors.

It may help or not - but any additional information is valuable.

ADD REPLY • link 5.3 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Can you provide a little more background? Where did you get the file? Some co-worker / collaborator passed it to you? If so, ask them. Did you download it from some site / database / paper? Then please tell us where from.

My guess is this is some unpublished internal / personal pipeline, and your only hope at getting a conclusive answer is asking the person who created it.

Just guessing wildly - because guessing is free - I think the first number is the transcript identifier, Tcount (number) is the count of reads for sample T, Acount (number) is the count of reads for sample A, Bcount (number) is the count of reads for sample B, length (number) is the length of the transcript.

ADD REPLY • link 5.3 years ago by h.mon 35k