Biostar Beta. Not for public use.
Failure in alignment with BWA using fusion reference genome
0
Entering edit mode
2.2 years ago
@marongiu.luigi22078

Dear all,

I have used the fasta genome provided by NCBI as reported in this post . The headers of this file are:

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
[...]
>chrUn_GL000218v1  AC:GL000218.1  gi:224183305  LN:161147  rl:unplaced  M5:1d708b54644c26c7e01c2dad5426d38c  AS:GRCh38
>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular


I need to use a fusion genome built by concatenating this human genome with one obtained from selected virus sequences. This virus genome is formed by a single header and a long stretch of nucleotides derived from individual virus sequences.

Following the indications given in this other post I prepared the header for the virus genome as follows:

>chrV  AC:XXXXXXXX.1  gi:00000000  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:1  SP:All_viruses  tp:linear


I made up accession number AC to XXXXXXXX.1 because there is no real entry for my made-up genome in Genbank & NCBI; since the IDs given in the human genome are 8 digit long, I gave a 8 letters fake entry and a ".1" because this is first time I am using this genome (maybe I should have used two letter, 6 numbers?).

Same for the GI number: the made up genome is not recorded in GenBank, thus I simply gave a fake 8 digit number.

LN is the length of the genome, I treated it as a real chromosome and M5 derives for md5sum I made on the fasta file. AS and SP are free text fields (I assumed) and the genome is linear.

By the way, I separated the fields with two spaces.

I concatenated the human genome and the made up virus genome with cat <human.fa> <virus.fa> > <fusion.fa> and I prepared the indices for this genome and aligned the samples with

bwa index -a bwtsw <fusion.fa>
bwa mem -t 8 -R <read_group> <fusion.fa> <R1.fq.gz> <R2.fq.gz> | \
samtools sort -o <file_ALN-SRT.sam>


However, I got this error message:

[bns_restore_core] Parse error reading <fusion.fa>.amb


and the SAM file is virtually empty:

cat <file_ALN-SRT.sam>
@HD VN:1.3  SO:coordinate


May I ask what I got wrong? When aligning against either one or the other genome separately the alignment is OK, thus it must be a problem with the headers I guess.

Thank you

alignment software error • 384 views
0
Entering edit mode

Does this help?

0
Entering edit mode

That would be weird because I could align to the individual genomes individually. How could spaces be introduced by simply concatenating two files? And that post indicates that "spaces in the header are OK". Anyway, I'll try this, thank you.

0
Entering edit mode

Nope, it did not. I tried with both sed -i 's/\s*$//g' (for spaces in the sequence) and sed -i 's/^[^>]\s*$//g' (for spaces in the header) followed by bwa index but the result was always the sae.

0
Entering edit mode

I also tried with printf "field\t...field\n" but I got the same error.

1
Entering edit mode

Let's check if the line ending terminators are correct.

$file <fusion.fa> <fusion.fa>: ASCII text  If the output is <fusion.fa>: ASCII text, with CRLF line terminators you should install dos2unix and run dos2unix <fusion.fa>. fin swimmer ADD REPLYlink 0 Entering edit mode That was a good tip that I shall remember, but there no Windows here: $ file GRCh38.fa
GRCh38.fa: ASCII text
$file virChr10510.fa virChr10510.fa: ASCII text$ file fusion38-10k.fa
fusion38-10k.fa: ASCII text

0
Entering edit mode

Could you please try to reindex fusion38-10k.fa. But this time without any parameters.

For -a bwtsw you used I see a warning in the manual about it:

Warning: -a bwtsw' does not work for short genomes, while -a is' and
-a div' do not work not for long genomes.


Without setting the algorithm bwa will auto select the best one.

fin swimmer

0
Entering edit mode

exactly the same error...

0
Entering edit mode
• Can you index and align against <human.fa> only?
• Can you index and align against <virus.fa> only?
• What's the output of grep "chrV" -A 5 -B 5 <fusion.fa>?
0
Entering edit mode
• yes
• yes
• grep "chrV" -A 5 -B 5 fusion38-10k.fa GGGGGCGGCGCGGGAGCCTGCACGCCGTTGGAGGGTAGAATGACAGGGGGCGGGGACAGAGAGGCGGTCG CGCCCCCGGCCGCGCCAGCCAAGCCCCCAAGGGGGGCGGGGAGCGGGCAATGGAGCGTGACGAAGGGCCC CAGGGCTGACCCCGGCAAACGTGACCCGGGGCTCCGGGGTGACCCAGCCAAGCGTGACCAAGGGGCCCGT GGGTGACACAGGCAACCCTGACAAAGGCCCCCCAGGAAAGACCCCCGGGGGGCATCGGGGGGGGTGTTGG CGGGGGCATGGGGGGGTCGGATTTCGCCCTTATTGCCCTGTTT >chrV AC:KI270741.1 gi:86261678 LN:370064105 rl:Chromosome M5:5aa5be7025d7baa666a8651e0909e4ce AS:GRCh38 SP:All_viruses AACATGGATTTTGATAAAATTTATAAAGATTTAGTAATGGATATTATTAAGAACGGAGTAACAGAATTACCAGAAGGGACTCGTGCAGTATATGCTGATG GGACTCCTGCAACAACTAAGTTTATTGAAGGTGTAAACTTTACAATCACACCAGATATGGGGGCTCCTTTATTACGTAGTAAGCATGTAGGTATTAAATG GGCGCTAACGGAATTGCAGTGGATTTGGCAAGAAATGTCTAATGAGGTTTCGTGGTTAAAAGAACGAGGTGTAACTATTTGGGACGAATGGGAGCAAGAA GATGGCACTATTGGAAAAGCGTATGGTTACGCATTATTTAACAAAGAACGTTTTATTCCTGTTTATAGTGGAGATGTTGAGTTAGCTAAAAATCGCAAAG CAGGTTCACTTATTCCAAAAGTCGTACCTAAAAAAGGTGGTTGGATAGCAGGAACACATCAACCATACCTAAAGCTAAATCAGGTAGAGGCAGTTATTGA  ADD REPLYlink 0 Entering edit mode Hm, so the problem arises only in the concatenated file. The position in <fusion.fa> where the file were concatenated looks fine to me. Which version of bwaare you using? Have you done anything else after cat <human.fa> <virus.fa> > <fusion.fa> other than bwa index <fusion.fa>? fin swimmer ADD REPLYlink 0 Entering edit mode BWA is Version: 0.7.17-r1188. I did nothing else. You might notice that I changed the fields of chrV to match those of the human chromosome, but no improvements. Although unlikely, could the word length of the sequences (excluding the headers) affect the alignment? The human genomes has 80 characters per line and the virus has 100. ADD REPLYlink 4 Entering edit mode 2.1 years ago @finswimmer37605 You might notice that I changed the fields of chrV to match those of the human chromosome, but no improvements. Everything after the first whitespace is just a description and have no influence on anything. bwa, samtools and other just ignore that part. Although unlikely, could the word length of the sequences (excluding the headers) affect the alignment? The human genomes has 80 characters per line and the virus has 100. I also thought about that, even if this would be weird. But at least in my testing here, this have no influence. According to your answers here the problem arises only in the concatenated file. Let's check if there are any non valid characters within the sequences. Does this little awk script produce any output:  awk '/[^a-zA-Z]/ && $0 !~"^>" {print "invalid char at line ", NR}' <fusion.fa>  fin swimmer ADD COMMENTlink 0 Entering edit mode Ah, that is very clever! Yes it did find some troubles: $ awk '/[^a-zA-Z]/ && $0 !~"^>" {print "invalid char at line ", NR}' fusion38-10k.fa invalid char at line 44558934 invalid char at line 46471885$ sed '44558934q;d' fusion38-10k.fa
TGCTTAGATGTAAGAGATAAACATTTAAAAGTGGAGTGAGCTAACCTCAATTGCAAGAGTCAAGCCCTGGAGCGCCAACTCCCAGAAGGGGGCGAACCAA
TTGAGCACTCTCACACACACCAAAAAGATTTTCCCTTTTCTTTGTACAGGGTTGGTTTCACCTCAACACCATAAAAGATCCGGAAACGGTGAAGGGCTTG
$sed '46471885q;d' fusion38-10k.fa GTGTTCAAAATC@AACGCTAAGTGTCGCCCCGAACGACGTAGGGAAGCTAAAGAAAATAAGACACCTGGGTATCTTCTTGAACTATGTCGTAAACGAATG  looks like there is a newline or a gap in the former and the "@" character I don't know how it came out in the first place. Both are present in the file I prepared: $ grep -n "^>" fusion38-10k.fa
1:>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
...
44282437:>chrEBV  AC:AJ507799.2  gi:86261677  LN:171823  rl:decoy  M5:6743bd63b3ff2b5b8985d8933c53290a  SP:Human_herpesvirus_4  tp:circular
44284893:>chrV  AC:KI270741.1  gi:86261678  LN:370064105  rl:Chromosome   M5:5aa5be7025d7baa666a8651e0909e4ce  AS:GRCh38  SP:All_viruses


I amended the "@" with:

$sed 's/@//' fusion38-10k.fa > fusion38-10k2.fa$ grep @ fusion38-10k2.fa


I tried to solve the former problem with:

sed '44558934s/[[:space:]]*/\n/' fusion38-10k.fa > fusion38-10k2.fa


but the illegal character was not a space/tab; by using a text editor I found that it was some kind of unformatted value. So I removed it manually from the text editor. Checked the quality of the file with the awk command, re-cat it and re-index it. I am testing the alignment now, I will update on the outcome. Thanks!

0
Entering edit mode

The alignment has started! So the problem was indeed about those faulty characters. Case closed. Thank you so much again.

1
Entering edit mode

Fine if we got it. I moved my comment to an answer, so you can mark this thread as solved.

I would suggest you investigate a bit more, what goes wrong on your system when using cat`. Not that your system is broken in someway and you get more problems because of this unexpected behaviour.

fin swimmer

0
Entering edit mode

I tried to investigate but none of the steps introduced funny characters. It will remain a mistery...