Calculate length of sequences from fasta file
4
1
Entering edit mode
6.4 years ago
KVC_bioinfo ▴ 590

Hello all,

I want to calculate the length of sequences from a huge FASTA file. I have read names of which I need to calculate the length from FASTA file. I know how to do it all reads using the following command.

awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' file.fasta

However, I am not sure, how to do it for a specific number of reads. Could someone help me here?

fasta • 16k views
ADD COMMENT
0
Entering edit mode

it's not clear what is a "read" in this context.

ADD REPLY
0
Entering edit mode

Read: meaning sequence name

>read1

atgctatcat

>read2

atgctactagtcga

ADD REPLY
0
Entering edit mode

However, I am not sure, how to do it for a specific number of reads

just use head ?

ADD REPLY
0
Entering edit mode

The simple way is like ./seqkit fx2tab -l file.fasta > file_seq_with_length.txt

And further you can extract particular column using awk e.g. awk '{print $1, $3}' file_seq_with_length.txt > extrcted_length.txt Usually, your first column will contain Ids and the last column will contain sequence length.

Hoping it may help someone.... :)

ADD REPLY
0
Entering edit mode

OP wants % GC content in addition to length @ archana.bioinfo87. Probably following command with seqkit would work:

$ seqkit fx2tab -lg test.fa | grep -f ids.txt

test.fa being input fasta and ids.txt being text file with ids without ">".

ADD REPLY
2
Entering edit mode
6.4 years ago
Joe 21k

I would just subset the sequences you need by name using any one of dozens of tools that have been posted on here before, in to a different file, and then just calculate all the lengths in the new file.

ADD COMMENT
1
Entering edit mode

Thanks. It made my job easy

ADD REPLY
0
Entering edit mode

Be sure to select something as an answer if it helped solve your problem.

ADD REPLY
0
Entering edit mode

sure. Thank you. will do that

ADD REPLY
3
Entering edit mode
6.4 years ago

Use bioawk, it treats a FASTA/FASTQ record a single awk record, that way you can count the sequences while computing their cumulative length

https://github.com/lh3/bioawk

then counting the cumulative size of the first ten sequences would be

cat reads.fq | bioawk -c fastx 'NR < 11 { size += length($seq) } END { print size }'
ADD COMMENT
3
Entering edit mode
6.4 years ago

output:

$ seqtk subseq test.fa ids.txt | infoseq stdin -only -name -length -pgc
Display basic information about sequences
Name           Length %GC    
read2          14     42.86  
read3          6      50.00

Input : - huge fasta file: test.fa

$ cat test.fa 
>read1
atgctatcat
>read2
atgctactagtcga
>read3
atgctg
>read4
atgc

Sequences of interest (ids.txt)

$ cat ids.txt 
read3
read2

BTw, what do you mean by "specific number of reads" ? EMBOSS tools doesn't recognize n esp when calculating gc content. However it does fairly good job in counting characters.

ADD COMMENT
3
Entering edit mode
5.8 years ago

Another solution (just thought to post it since I did it now):

awk -F "" '/^>/{ if(NR!=1) {print len}; printf $0"\t"; len=0}; \
  !/^>/{len+=NF}' \
  gencode.v28.transcripts.fa |\
  head -35

>ENST00000456328.2|ENSG00000223972.5|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|    1657
>ENST00000450305.2|ENSG00000223972.5|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|   632
>ENST00000488147.1|ENSG00000227232.5|...RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|    1351
>ENST00000619216.1|ENSG00000278267.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|  68
>ENST00000473358.1|ENSG00000243485.5|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA| 712
>ENST00000469289.1|ENSG00000243485.5|...|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA| 535
>ENST00000607096.1|ENSG00000284332.1|...|MIR1302-2-201|MIR1302-2|138|miRNA| 138
>ENST00000417324.1|ENSG00000237613.2|...|RP11-34P13.4-001|FAM138A|1187|lincRNA| 1187
>ENST00000461467.1|ENSG00000237613.2|...|RP11-34P13.4-002|FAM138A|590|lincRNA|  590
>ENST00000606857.1|ENSG00000268020.3|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|   840
>ENST00000642116.1|ENSG00000240361.2|...|RP11-34P13.6-002|OR4G11P|1414|processed_transcript|    1414
>ENST00000492842.2|ENSG00000240361.2|...|RP11-34P13.6-001|OR4G11P|939|transcribed_unprocessed_pseudogene|   939
>ENST00000641515.2|ENSG00000186092.6|...|RP11-34P13.5-001|OR4F5|2618|protein_coding|    2618
>ENST00000335137.4|ENSG00000186092.6|...|OR4F5-201|OR4F5|1054|protein_coding|   1054
>ENST00000466430.5|ENSG00000238009.6|...|RP11-34P13.7-001|RP11-34P13.7|2748|lincRNA|    2748
>ENST00000477740.5|ENSG00000238009.6|...|RP11-34P13.7-003|RP11-34P13.7|491|lincRNA| 491
>ENST00000471248.1|ENSG00000238009.6|...|RP11-34P13.7-002|RP11-34P13.7|629|lincRNA| 629
>ENST00000610542.1|ENSG00000238009.6|...|AL627309.1-205|RP11-34P13.7|723|lincRNA|   723
>ENST00000453576.2|ENSG00000238009.6|...|RP11-34P13.7-004|RP11-34P13.7|336|lincRNA| 336
>ENST00000495576.1|ENSG00000239945.1|...|RP11-34P13.8-001|RP11-34P13.8|1319|lincRNA|    1319
>ENST00000442987.3|ENSG00000233750.3|...|RP11-34P13.10-001|RP11-34P13.10|3812|processed_pseudogene| 3812
>ENST00000494149.2|ENSG00000268903.1|...|RP11-34P13.15-001|RP11-34P13.15|755|processed_pseudogene|  755
>ENST00000595919.1|ENSG00000269981.1|...|RP11-34P13.16-001|RP11-34P13.16|284|processed_pseudogene|  284
ADD COMMENT

Login before adding your answer.

Traffic: 1801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6