This site is a beta test.
Question: Fast way to extract specific sequences from large fasta
0
Entering edit mode
4 months ago
mnsp088 • 10

Hi all!

I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.

for file in `find text_files | grep .txt`;
do
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa
done

Any ideas on better ways to go about this? particularly to speed things up.

Thank you for your suggestions.

ADD COMMENTlink 4 months ago mnsp088 • 10
Entering edit mode
0

You should show examples of the protein names from the protein names files and from the fasta file.

Also, did you edit the find text_files | grep .txt part? I don't think this command would find anything at all.

Anyway, I think

samtools faidx master_fasta.fs $(cut -c 1- ${file}) > ./fastas/$file2.fa

would be faster than

cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

edit: you can also do

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa
ADD REPLYlink 4 months ago
h.mon
25k
Entering edit mode
0

faSomeRecords utility from Jim Kent at UCSC should be a fast way to do this.

Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file

ADD REPLYlink 4 months ago
genomax
68k

Login before adding your answer.

Powered by the version 1.5.2