Question: Extraction of first sequences from a big fasta file
2
Entering edit mode
3 months ago
vahapel • 170
Turkey

Dear All,
I would like to ask a question regarding extraction of 100000 sequences in a big fasta file. In the forum, there is a bunch of script handling the sequence extraction based on ID number, but I could not find a script for such a purpose. Basically, is there any script or bash command for extraction first and/or last 100000 sequences in a fasta file ?

Many thanks in advance for all your help!

ADD COMMENTlink 4.6 years ago vahapel • 170 • updated 3 months ago SMK ♦ 1.3k
Entering edit mode
0

If not, you could write one easily enough with biopython or bioperl...

Well, the first X records is easier than the last X records, but still.

ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
0

You can have a look at this. May be helpful for you

Extract sequence with header from a fasta file with specific ID given in another file

ADD REPLYlink 3.3 years ago
Tanvir Ahamed
• 270
Entering edit mode
0

That is not an answer for the question that was originally asked.

ADD REPLYlink 3.3 years ago
genomax
68k
6
Entering edit mode
3 months ago
dariober 10k
WCIP | Glasgow | UK

This strategy is based on standard nix tools. Get the *first two sequences:

awk -v RS='>' 'NR>1 { gsub("\n", ";", $0); sub(";$", "", $0); print ">"$0 }' seq.fa \
    | head -n 2 \
    | tr ',' '\n'

It assumes the semicolon ; doesn't occur in the sequence names.

Replace head with tail to get the last sequences.

ADD COMMENTlink 4.6 years ago dariober 10k • updated 3 months ago RamRS 21k
Entering edit mode
0

Hi, dariober thank you so much for this script. It works well

ADD REPLYlink 4.6 years ago
vahapel
• 170 • updated 3 months ago
RamRS
21k
2
Entering edit mode
3 months ago
Brian Bushnell 16k
Walnut Creek, USA

For the first 100k sequences:

reformat.sh in=data.fasta out=100k.fasta reads=100000

You can also get a random 100k sequences with Reformat, but not the last 100k.

ADD COMMENTlink 4.6 years ago Brian Bushnell 16k • updated 3 months ago RamRS 21k
Entering edit mode
0

Hi Brian, thank you for your help and introducing "bbmap" tools for me, it makes my project more easy

ADD REPLYlink 4.6 years ago
vahapel
• 170
Entering edit mode
0

You're welcome!

ADD REPLYlink 4.6 years ago
Brian Bushnell
16k
2
Entering edit mode
3 months ago
5heikki 8.4k
Finland

Assuming no linebreaks in sequences, i.e every record is exactly two lines. I believe for fastq files the multiplier would be 4:

head -n 200000 input > output
ADD COMMENTlink 4.6 years ago 5heikki 8.4k • updated 3 months ago RamRS 21k
Entering edit mode
0

thank you, 5heikki for your help, it is very simple and useful command.

ADD REPLYlink 4.6 years ago
vahapel
• 170
Entering edit mode
0

likewise, use tail for the last N sequences.

ADD REPLYlink 23 months ago
st.ph.n
♦ 2.5k
1
Entering edit mode
3 months ago
SMK ♦ 1.3k
Ghent, Belgium

Alternatives: seqkit head and seqkit range:

(1) Leading 100000 records:

seqkit head -n 100000 input.fa
seqkit range -r 1:100000 input.fa

(2) Last 100000 records:

seqkit range -r -100000:-1 input.fa

(3) Other ranges:

seqkit range -r 100001:200000 input.fa
ADD COMMENTlink 3 months ago SMK ♦ 1.3k
0
Entering edit mode
3 months ago
Simply Bioinformatics • 150
WashingtonDC
REQUESTED_LINES=10
awk "/^>/ {n++} n>$REQUESTED_LINES {exit} {print}" input.fasta > output.fasta
ADD COMMENTlink 23 months ago Simply Bioinformatics • 150 • updated 3 months ago RamRS 21k
0
Entering edit mode
3 months ago
psschlogl • 0
hdl = gzip.open(file, 'rt')
records = SeqIO.parse(hdl, 'fastq')
first_read = next(records)
printfirst_read.id)
ADD COMMENTlink 3 months ago psschlogl • 0
Entering edit mode
1

Hi psschlogl

Though your answer shows some potential it does not really answer the question posted. If you can provide a more complete/correct answer we could keep it in the answers thread. If not, we consider removing it.

ADD REPLYlink 3 months ago
lieven.sterck
5.1k

Login before adding your answer.

Powered by the version 1.5