Get chromosome sizes from fasta file
4
6
Entering edit mode
8.2 years ago
rioualen ▴ 710

Hello,

I'm wondering whether there is a program that could calculate chromosome sizes from any fasta file? The idea is to generate a tab file like the one expected in bedtools genomecov for example.

I know there's the fetchChromSize program from UCSC, but not all genomes are available over there (I need TAIR10 for instance). I've read this topic already.

I would like a tool that can deal with any genome regardless of the database. If it doesn't exist I guess it's possible to just parse fasta files, but I'd be surprised if no one else had done it before!

Cheers

genome ucsc • 42k views
ADD COMMENT
1
Entering edit mode

A quick google search would help you. For e.x

http://www.danielecook.com/generate-fasta-sequence-lengths/

ADD REPLY
0
Entering edit mode

I did look it up on google, but thank you anyway.

ADD REPLY
24
Entering edit mode
8.2 years ago
pip install pyfaidx
faidx input.fasta -i chromsizes > sizes.genome

That's what you're looking for if you want to use bedtools genomecov, but you can transform to a BED file as well using -i bed.

ADD COMMENT
0
Entering edit mode

Looks perfect! Thank you.

ADD REPLY
26
Entering edit mode
8.2 years ago
rioualen ▴ 710

Just found out faidx is available with samtools so another possibility is:

samtools faidx input.fa
cut -f1,2 input.fa.fai > sizes.genome
ADD COMMENT
1
Entering edit mode

ALSo can be in this format at once:

samtools faidx genome.fa | cut -f1,2 > chromsizes

ADD REPLY
0
Entering edit mode

Tried this. Didn't work because samtools faidx genome.fa outputs to genome.fa.fai. Therefore, the command by rioualen should be the one used.

Unless maybe you use some type of flags in the command, maybe you could use pipes to run it as a one-line command as Apex92 tried.

ADD REPLY
0
Entering edit mode

This is the only line that worked for me, except the line only produced a output.fa.fai file which was identical to the chromsizes file i was looking except for extra columns (ie cut -f1,2 didnt work in this line)

samtools faidx input.fa | cut -f1,2 > chromsizes

However after getting that .fa.fai file i was able to cut out just the columns i needed (chromosome name & size) to produce the .sizes files i needed (also correction to the above the cut command needs to be given an input and output files).

cut -f1,2 Hel_final_2016_new.fa.fai > Hel_final_2016_new.fa.sizes

As a result you can run this line twice to make it work (since the first time it runs it will create the .fai file that the second use of the line will require):

samtools faidx Hel_final_2016_new.fa | cut -f1,2 Hel_final_2016_new.fa.fai > Hel_final_2016_new.fa.sizes

ADD REPLY
0
Entering edit mode

samtools faidx does not work for this purpose.

ADD REPLY
2
Entering edit mode
5.0 years ago
gbdias ▴ 150
  • Another good option is to use the faSize command from UCSC tools. It works on any fasta file, not just UCSC genomes.

  • You run it like this: faSize -detailed -tab file.fasta

  • Default behavior is to print to STDOUT, so you can redirect the output to a file like this: faSize -detailed -tab file.fasta > output.txt

  • The output will be a tab delimited file with sequence name on the first column and the length on the second.

  • You can easily install faSize and other tools from UCSC using Bioconda https://anaconda.org/bioconda/ucsc-fasize

ADD COMMENT
0
Entering edit mode
5.6 years ago
aleferna ▴ 10

samtools was crashing because I included the HLA's,

HLA-A*01:01:38L 3374

 

Traceback (most recent call last):
  File "/usr/local/bin/faidx", line 11, in <module>
    load_entry_point('pyfaidx==0.5.2', 'console_scripts', 'faidx')()
  File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 197, in main
    write_sequence(args)
  File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 50, in write_sequence
    outfile.write(transform_sequence(args, fasta, name, start, end))
  File "/usr/local/lib/python2.7/dist-packages/pyfaidx/cli.py", line 120, in transform_sequence
    line_len = fasta.faidx.index[name].lenc
KeyError: 'HLA-A*01'

Finally just did:

from Bio import SeqIO
for rec in SeqIO.parse("hg38-Mix.fa","fasta"):
     print rec.id+"\t"+str(len(rec.seq))
ADD COMMENT
0
Entering edit mode

faidx was failing because the cli script expects UCSC-style regions encoded as contig:start-end, where start-end is optional. Since the HLA alt contains contain : characters you ran into a parsing issue. If you want to use pyfaidx in the same way as biopython you can do:

from pyfaidx import Fasta
for rec in Fasta("hg38-Mix.fa"):
  printrec.name + str(len(rec)))

This will be much faster if you have a lot of sequences since none of the pyfaidx operations require reading the sequence unless you start slicing strings.

ADD REPLY

Login before adding your answer.

Traffic: 2108 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6