Biostar Beta. Not for public use.
Error with samtools faidx...Different line length in sequence
0
Entering edit mode
2.7 years ago
oars • 150
@oars41179

I've tried making a "fai" file using the below command:

samtools faidx genome.fa

But I get an error:

[E::fai_build_core] Different line length in sequence '(null)'
Could not build fai index /media/dcpattie/backup/Week08/genome.fa.fai

I found this old thread: Error while doing indexing of fasta file using SAMTOOL faidx

Is the best course of action the following sequence of steps?

java -jar picard.jar NormalizeFasta \
      I=input_genome.fa \
      O=normalized_genome.fa
samtools faidx • 2.3k views
ADD COMMENTlink
1
Entering edit mode

Are there any blank lines in your genome.fa file?

ADD REPLYlink
0
Entering edit mode

I'm not sure, its a 3GB file. I pulled the file from the UCSC website, then after making my genome.fa output file I used...

bwa index genome.fa

...to create five additional files, all named with the prefix genome.fa (.sa; .ann; .amb; .pac; .bwt).

Now I'm trying to make a genome.fa.fai file but this is much harder than advertised.

ADD REPLYlink
1
Entering edit mode

if the following command shows different length of lines.

awk '/^>/ {print;next;} {print length($0);}'  genome.fa  | uniq

then you'll have to reformat your fasta sequence with NormalizeFasta

ADD REPLYlink
0
Entering edit mode

First - many thanks for your reply!

now, the results...

30
50
21
>chr2
50
23
>chr3
50
30
>chr4
50
26
>chr5
50
10
>chr6
50
17
>chr7
50
13
>chr8
50
22
>chr9
50
31
>chr10
50
47
>chr11
50
16
>chr12
50
45
>chr13
50
28
>chr14
50
40
>chr15
50
42
>chr16
50
3
>chr17
50
10
>chr18
50
48
>chr19
50
33
>chr20
50
20
>chr21
50
45
>chr22
50
16
>chrX
50
10
>chrY
50
16
ADD REPLYlink
0
Entering edit mode

It seems your >chr1 header is corrupted, here is what I get:

>chr1
50
22

So the genome is formatted with 50 bases per line, with variations in the last line for each chromosome - the differences between my results and yours for these last lines is probably due to different versions of the genome.

What is the output of head -n2 genome.fa?

ADD REPLYlink
0
Entering edit mode
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ADD REPLYlink
1
Entering edit mode

So why the output of awk '/^>/ {print;next;} {print length($0);}' genome.fa | uniq is:

30
50
21

It should be:

>chr1
50
21
ADD REPLYlink
0
Entering edit mode

oars : As an aside, if your genome.fa is corrupt then your bwa indexes are no good either.

ADD REPLYlink
0
Entering edit mode

I don't know what went wrong, here is my commands:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
$ tar xvfz chromFa.tar.gz
$ for x in {1..22} X Y;do cat chr$x.fa;done > genome.fa
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3