Biostar Beta. Not for public use.
Extraction of first sequences from a big fasta file
2
Entering edit mode
3.5 years ago
vahapel • 170
Turkey

Dear All,
I would like to ask a question regarding extraction of 100000 sequences in a big fasta file. In the forum, there is a bunch of script handling the sequence extraction based on ID number, but I could not find a script for such a purpose. Basically, is there any script or bash command for extraction first and/or last 100000 sequences in a fasta file ?

Many thanks in advance for all your help!

sequence • 6.5k views
ADD COMMENTlink
0
Entering edit mode

If not, you could write one easily enough with biopython or bioperl...

Well, the first X records is easier than the last X records, but still.

ADD REPLYlink
0
Entering edit mode
ADD REPLYlink
0
Entering edit mode

That is not an answer for the question that was originally asked.

ADD REPLYlink
6
Entering edit mode
13 months ago
WCIP | Glasgow | UK

This strategy is based on standard nix tools. Get the *first two sequences:

awk -v RS='>' 'NR>1 { gsub("\n", ";", $0); sub(";$", "", $0); print ">"$0 }' seq.fa \
    | head -n 2 \
    | tr ',' '\n'

It assumes the semicolon ; doesn't occur in the sequence names.

Replace head with tail to get the last sequences.

ADD COMMENTlink
0
Entering edit mode

Hi, dariober thank you so much for this script. It works well

ADD REPLYlink
2
Entering edit mode
14 months ago
Walnut Creek, USA

For the first 100k sequences:

reformat.sh in=data.fasta out=100k.fasta reads=100000

You can also get a random 100k sequences with Reformat, but not the last 100k.

ADD COMMENTlink
0
Entering edit mode

Hi Brian, thank you for your help and introducing "bbmap" tools for me, it makes my project more easy

ADD REPLYlink
0
Entering edit mode

You're welcome!

ADD REPLYlink
2
Entering edit mode
12 months ago
5heikki 8.4k
Finland

Assuming no linebreaks in sequences, i.e every record is exactly two lines. I believe for fastq files the multiplier would be 4:

head -n 200000 input > output
ADD COMMENTlink
0
Entering edit mode

thank you, 5heikki for your help, it is very simple and useful command.

ADD REPLYlink
0
Entering edit mode

likewise, use tail for the last N sequences.

ADD REPLYlink
1
Entering edit mode
12 months ago
SMK ♦ 1.3k
Ghent, Belgium

Alternatives: seqkit head and seqkit range:

(1) Leading 100000 records:

seqkit head -n 100000 input.fa
seqkit range -r 1:100000 input.fa

(2) Last 100000 records:

seqkit range -r -100000:-1 input.fa

(3) Other ranges:

seqkit range -r 100001:200000 input.fa
ADD COMMENTlink
0
Entering edit mode
12 months ago
WashingtonDC
REQUESTED_LINES=10
awk "/^>/ {n++} n>$REQUESTED_LINES {exit} {print}" input.fasta > output.fasta
ADD COMMENTlink
0
Entering edit mode
12 months ago
psschlogl • 0
hdl = gzip.open(file, 'rt')
records = SeqIO.parse(hdl, 'fastq')
first_read = next(records)
printfirst_read.id)
ADD COMMENTlink
1
Entering edit mode

Hi psschlogl

Though your answer shows some potential it does not really answer the question posted. If you can provide a more complete/correct answer we could keep it in the answers thread. If not, we consider removing it.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1