Question

Custom nr database with only arthropods ?

1

Entering edit mode

5.6 years ago

Picasa ▴ 640

Hi,

I would like to make a custom database (nr) for blast local running but containing only arthropods sequences.

Any way to do this ?

Thanks a lot.

nr blast arthropods • 4.5k views

ADD COMMENT • link updated 5.6 years ago by shenwei356 8.4k • written 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

I think you should be able to modify the taxonomy ID in tutorials provided in the following post to suit your requirements.

Vertebrate Subset Nr Database? Build My Own?

ADD REPLY • link 5.6 years ago by Sej Modha 5.3k

0

Entering edit mode

Thanks for you answer !

But this topic is 8 years old and seems like those gi methods don't work anymore (message from genomax)

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

1

Entering edit mode

Yes, you're right, I completely forgot about that. I guess you could either use -seqidlist which is equivalent of -gilist in the current version of BLAST+ 2.7 or you could use the alpha release of the newer version of BLAST+ that allows taxonomy ID based filter: https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

Recent enhancements to BLAST+
We have made some recent enhancement to the BLAST+ applications that allow you to:
1.) Limit your search by taxonomy using information built into the BLAST databases.

ADD REPLY • link 5.6 years ago by Sej Modha 5.3k

0

Entering edit mode

Extract all protein sequences of specific taxons from the NCBI nr database, and makeblastdb.

ADD REPLY • link 5.6 years ago by shenwei356 8.4k

0

Entering edit mode

I have used your tutorial but it seems that the fasta at the end (made with seqkit) is not complete:

grep '>' nr.arthropods.fasta -c
4119698

wc -l arthropods.taxid.acc.txt

8228966 arthropods.taxid.acc.txt

I have downloaded the last prot.accession2taxid.gz and nr.gz .

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

do you have a way to fix ?

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

what's the taxid are you using?

ADD REPLY • link 5.6 years ago by shenwei356 8.4k

0

Entering edit mode

Using Arthropods: 6656

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

change

seqkit grep -f virus.taxid.acc.txt nr.gz -o nr.virus.fa.gz

to

zcat nr.gz \
    | perl -e 'BEGIN{ $/ = ">"; <>; } while(<>){s/>$//;  $i = index $_, "]\n"; $h = substr($_, 0, $i)."]"; $s = substr $_, $i+2; while($h =~ /(.+?) \[(.+?)\]/g){ print ">$1 [$2]\n$s";} } ' \
    | seqkit grep -f arthropods.taxid.acc.txt -o nr.arthropods.fa.gz

Where the Perl script is for unfolding the records with same sequences in different species

$ cat t.fa 
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]EAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVA
INGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQ
PYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPV
SLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM

to

$ cat t.fa | perl -e 'BEGIN{ $/ = ">"; <>; } while(<>){s/>$//;  $i = index $_, "]\n"; $h = substr($_, 0, $i)."]"; $s = substr $_, $i+2; while($h =~ /(.+?) \[(.+?)\]/g){ print ">$1 [$2]\n$s";} } '
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVA
INGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQ
PYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPV
SLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>EAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVA
INGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQ
PYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPV
SLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM

ADD REPLY • link 5.6 years ago by shenwei356 8.4k

0

Entering edit mode

done.

But I still got low number of sequence compared to the accession number:

grep '>' nr.arthropods.seqkit2.fasta -c
4133470

is it normal ?

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

Where the Perl script is for unfolding the records with same sequences in different species

Possibly because shenwei356 is discarding records with same sequence.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Ok.

Anyway. I have tried to make a database with this new fasta, with this command:

makeblastdb -in ../nr.arthropods.fasta -parse_seqids -dbtype prot -out nr.arthropods

and I got this error:

Building a new DB, current time: 09/10/2018 21:29:34
New DB name:   db_blast/nr.arthropods
New DB title:  ../nr.arthropods.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 279052: 3-9, 20-21, 31, 44, 46-48, 51-55, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279053: 1-5, 16-17, 27, 52-53, 57-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279054: 1-3, 14-15, 25, 45-46, 50-56
FASTA-Reader: Ignoring invalid residues at position(s): On line 279055: 6-7, 17, 42-43, 47-53
FASTA-Reader: Ignoring invalid residues at position(s): On line 279056: 3-4, 14, 33, 50-52, 56-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279057: 1-2, 12-13, 23, 42, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279058: 1, 5-11, 22-23, 33, 54-55, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279059: 1-5, 23-24, 34, 46, 52-57
FASTA-Reader: Ignoring invalid residues at position(s): On line 279060: 1-7, 17-18, 28, 50-51, 55-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279061: 1, 11-12, 22, 44-45, 49-55
FASTA-Reader: Ignoring invalid residues at position(s): On line 279062: 5-6, 16, 38-39, 43-49, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279063: 9, 31-32, 36-42, 52-53
FASTA-Reader: Ignoring invalid residues at position(s): On line 279064: 4, 26-27, 31-37, 47-48, 58
FASTA-Reader: Ignoring invalid residues at position(s): On line 279065: 12, 19-21, 26-27, 31-37, 48-49, 59
FASTA-Reader: Ignoring invalid residues at position(s): On line 279066: 17-18, 22-28, 39-40, 50
FASTA-Reader: Ignoring invalid residues at position(s): On line 279067: 8-9, 13-19, 30-31, 41
FASTA-Reader: Ignoring invalid residues at position(s): On line 279068: 3-4, 8-14, 25-26, 36, 51-52, 56-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279069: 1-2, 13-14, 24, 36, 38, 43-48, 52-58
FASTA-Reader: Ignoring invalid residues at position(s): On line 279070: 8, 10, 20, 37-38, 42-48, 59-60
FASTA-Reader: Ignoring invalid residues at position(s): On line 279071: 9, 30-31, 35-41, 50

volume: db_blast/nr.arthropods

file: db_blast/nr.arthropods.pin
file: db_blast/nr.arthropods.phr
file: db_blast/nr.arthropods.psq

BLAST Database creation error: Input doesn't start with a defline or comment around line 279072

by looking at this line, I have something indeed a bit weird:

sed -n '279072p' < nr.arthropods.fasta 
[Alpheus gracilipes]AFN55284.1 histone H3, partial [Delecto

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

I'll check it after a full download of nr.gz. And have to go to work now.

ADD REPLY • link 5.6 years ago by shenwei356 8.4k

0

Entering edit mode

yes so apparently, some sequences don't start with '>'

ADD REPLY • link 5.6 years ago by Picasa ▴ 640

0

Entering edit mode

It's the bug of the perl one-liner, because of the not well formated nr.gz.

I gave up after few hours. May be we should blast on whole nr and filter the result.

ADD REPLY • link 5.6 years ago by shenwei356 8.4k

score 3 · Answer 1 · 2018-09-12

Data:

pre-formated blastdb (09/10/2018)
prot.accession2taxid.gz (09/07/2018) (optional, but recommended)

Hardware in this tutorial

CPU: AMD 8-cores/16-threads 3.7Ghz
RAM: 32GB
DISK:
- Taxonomy files stores in NVMe SSD
- blatdb files stores in 7200rpm HDD

Tools:

blast+
pigz (recommended, faster than gzip)
taxonkit
seqkit (recommended)

Steps:

Listing all taxids below $id using taxonkit.

id=6656

# 6656 is the phylum Arthropoda
# echo 6656 | taxonkit lineage | taxonkit reformat 
# 6656    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda    Eukaryota;Arthropoda;;;;;

# time: 3s 
taxonkit list --ids $id --indent "" > $id.taxid.txt 

wc -l $id.taxid.txt
# 518373 6656.taxid.txt

Retrieving target accessions. There are two options:

From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr.

# time: 4min
pigz -dc prot.accession2taxid.gz \
    | csvtk grep -t -f taxid -P $id.taxid.txt \
    | csvtk cut -t -f accession.version \
    | sed 1d \
    > $id.acc.txt


wc -l $id.acc.txt
# 8174609 6656.acc.txt

From pre-formated nr blastdb

# time: 40min
blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz

pigz -dc nr.acc2taxid.txt.gz | wc -l
# 555220892

# time: 3min
pigz -dc nr.acc2taxid.txt.gz \
    | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \
    | cut -d ' '  -f 1 \
    > $id.acc.txt


wc -l $id.acc.txt
# 6928021 6656.acc.txt

Retrieving FASTA sequences from pre-formated blastdb. There are two options:

From nr.fa exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formated.

# time: 117min
blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz

# time: 80min
# perl one-liner is used to unfold records having mulitple accessions
cat <(echo) <(pigz -dc nr.fa.gz) \
    | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
    | seqkit grep --delete-matched -f $id.acc.txt -o nr.$id.fa.gz

# counting sequences
# 
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 902M 9月  13 01:42 nr.6656.fa.gz
# 
pigz -dc nr.$id.fa.gz | grep '^>' -c

# 6928017
# Here 6928017 ~=  6928021 ($id.acc.txt)

Directly from pre-formated blastdb

# time: 5h20min
blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz


# counting sequences
#
# Note that the headers of outputed fasta by blastdbcmd are "folded"
# for accessions from different species with same sequences, so the 
# number may be small than $(wc -l $id.acc.txt).
pigz -dc nr.$id.fa.gz | grep '^>' -c
# 1577383

# counting accessions
#  
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 2.1G 9月  13 03:38 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c
# 288415413

makeblastdb

pigz -dc nr.$id.fa.gz > nr.$id.fa

# time: 3min ($nr.$id.fa from step 3 option 1)
#
# building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:
#
#     BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1
#
makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id

# rm nr.$id.fa

GenoMax · Answer 2 · 2018-09-11

Do not follow this answer.

~~Here's the whole commands (replacing pigz with gzip if pigz not installed).~~

id=6656 # Step 1) list all taxid below $id taxonkit list --ids $id --indent "" > $id.taxid.txt # Step 2) get all accessions # 3 min for me pigz -d -c prot.accession2taxid.gz \ | csvtk -t grep -f taxid -P $id.taxid.txt \ | csvtk -t cut -f accession.version \ > $id.taxid.acc.txt # Step 3) Option A (very slow) blastdbcmd -db nr -entry all -outfmt "%a\t%T" | \ csvtk -t grep -f 2 -P $id.taxid.acc.txt | \ csvtk -t cut -f 1 | \ blastdbcmd -db nr -entry_batch - -out nr.$id.fa # Step 3) Option B (relative fast but buggy) # reformat nr.gz and retrieve sequences (buggy) # 16 min for me cat <(echo) <(pigz -d -c nr.gz) \ | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ /\[.+\]/) { print ">$_"; next; } while($h =~ /(.+? \[.+?\])/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } '\ | seqkit grep --delete-matched -f $id.taxid.acc.txt -o nr.$id.fa.gz # checking sequences amount (with Option B) seqkit stats nr.$id.fa.gz # file format type num_seqs sum_len min_len avg_len max_len # nr.6656.fa.gz FASTA Protein 6,913,200 2,860,805,704 6 413.8 28,516 wc -l $id.taxid.acc.txt # 8174610 $id.taxid.acc.txt # Step 4) makeblastdb gzip -d -c nr.$id.fa.gz > nr.$id.fa makeblastdb -in nr.$id.fa -dbtype prot -out nr.$id

Well, as for 6,913,200 < 8,174,610:

Since many genes in different species have same sequences, they are recorded as one record in nr.gz.

The intuitional way is matching FASTA record heads with all target accesions using regular expression matching, however, it's very slow for such a big dataset:

$ pigz -d -c nr.gz | seqkit stats | tee stats.txt file format type num_seqs sum_len min_len avg_len max_len - FASTA Protein 169,199,080 61,790,996,688 6 365.2 74,488

So a better solution is "unfolding" these FASTA records.
But the FASTA headers in nr.gz are not well formated, it's hard to rightly unfold all records shared by muliple species. Here are some cases:

Normal. accession product [species]accession2 product2 [species2] (easy to solve)

>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]EAL68957.1 hypothetical protein DDB_G0276911 [Dictyosteli um discoideum AX4] Mxxxxxxxx

Some FASTA headers contain 0x01 (shown as ^ below) in front of accession (easy to solve)

$ pigz -d -c nr.gz | grep '^>' | grep CAH60339.1 | cat -A >CAH60339.1 histone h3, partial [Copelatus sp. 028TA588]^AABG79397.1 histone H3, partial [Lumbrineris magnidentata]^AABG79414.1 histone H3, partial [Polygordius lacteus]^AABX89916.1 histone 3, partial [Akalyptoischion atrichos] xxxx >^accesion product xxxxxx

Mix 1. Some headers contains species information in [], but thelast one does not. (there are ways)

>xxxxx [xxxx]aaaa bbbbbb

Mix 2. Where B3N7H1.1 does not end with ], so the last character 4 and the later accession A1Z7U3.1 lay together -_-! (HARD to solve)

>NP_001027396.1 uncharacterized protein Dmel_CG33774 [Drosophila melanogaster]XP_001970317.1 uncharacterized protein Dere_GG10558 [Drosophila erecta]B3N7H1.1 RecName: Full=Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4A1Z7U3.1 RecName: Full=Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4AAZ52820.1 uncharacterized protein Dmel_CG33774 [Drosophila melanogaster]EDV59376.1 uncharacterized protein Dere_GG10558 [Drosophila erecta]