How to count the length of fasta sequences?
1
0
Entering edit mode
5.0 years ago
Kumar ▴ 120

I know, how to count the length of a particular fasta sequence. But, I need to count the length of a particular fasta sequence based on the header listed in another txt file. The base length to be printed in another/specified excel file at column number 3. Therefore, please help me to do the same. Thank you in advance.

fasta shell scripting python awk grep • 14k views
ADD COMMENT
0
Entering edit mode

What have you tried? There are definitely solutions for this on the forum already.

ADD REPLY
2
Entering edit mode

Dear healey, The below command can list out the base length of each fasta sequences of the file,

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta

But, I do not know, how to print the specified header's length listed in another txt file. Moreover, how to print the length in another excel sheet.

ADD REPLY
0
Entering edit mode

You will have to provide sample data.

ADD REPLY
0
Entering edit mode

Suppose, I have a multi fasta sequences in org1.fasta as given below,

>seq1
ATGCTA

>seq2
GCTAGTT

>seq3
TAGC

I need to count the length of following header's listed in id.txt as given below,

seq1
seq2

And the results(length) to be printed at 3rd coloumn (Total length) of another csv file org.csv as shown below,

ID       hit_length    Total length
seq1    3                  **6**
seq2    4                  **7**
ADD REPLY
1
Entering edit mode

just add |paste - -

unique.fasta

>seq1
ATGCTA

>seq2
GCTAGTT

>seq3
TAGC

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta |paste - -

stdout

>seq1   6
>seq2   7
>seq3   4

If you need hit_length as the 2nd column there, then use a combination of GNU utilities paste and cut

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta |paste - - |cut -f 2 > col3

col3

6
7
4

Combine the files together

paste file-with-ID-as-col1-and-hit_length-as-col2 col3 > final.result

This assumes that the order is the same in your results example and in the unique.fasta file

ADD REPLY
0
Entering edit mode

what is hit_length

ADD REPLY
0
Entering edit mode

ID and hit_length columns are the existing data of csv file. I need to print the length at 3rd column as I specified with in * symbol.

ADD REPLY
0
Entering edit mode
5.0 years ago
fhsantanna ▴ 610

I suggest using faslen from FAST tools. https://github.com/tlawrence3/FAST

ADD COMMENT
0
Entering edit mode

Thank you fhsantanna,

faslen does the same thing what this does

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta

But, I need to get the length of listed fasta headers only.

ADD REPLY
0
Entering edit mode

You could do a faslen, and then "grep > file.fasta > headers.txt" will capture the headers.

ADD REPLY

Login before adding your answer.

Traffic: 1534 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6