Extraction of first sequences from a big fasta file
6
4
Entering edit mode
9.2 years ago
vahapel ▴ 210

Dear All,

I would like to ask a question regarding extraction of 100000 sequences in a big fasta file. In the forum, there is a bunch of script handling the sequence extraction based on ID number, but I could not find a script for such a purpose. Basically, is there any script or bash command for extraction first and/or last 100000 sequences in a fasta file?

Many thanks in advance for all your help!

sequence • 16k views
ADD COMMENT
0
Entering edit mode

If not, you could write one easily enough with biopython or bioperl...

Well, the first X records is easier than the last X records, but still.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

That is not an answer for the question that was originally asked.

ADD REPLY
6
Entering edit mode
9.2 years ago

This strategy is based on standard nix tools. Get the *first two sequences:

awk -v RS='>' 'NR>1 { gsub("\n", ";", $0); sub(";$", "", $0); print ">"$0 }' seq.fa \
    | head -n 2 \
    | tr ',' '\n'

It assumes the semicolon ; doesn't occur in the sequence names.

Replace head with tail to get the last sequences.

ADD COMMENT
0
Entering edit mode

Hi, dariober thank you so much for this script. It works well

ADD REPLY
4
Entering edit mode
9.2 years ago

For the first 100k sequences:

reformat.sh in=data.fasta out=100k.fasta reads=100000

You can also get a random 100k sequences with Reformat, but not the last 100k.

ADD COMMENT
0
Entering edit mode

Hi Brian, thank you for your help and introducing "bbmap" tools for me, it makes my project more easy

ADD REPLY
0
Entering edit mode

You're welcome!

ADD REPLY
2
Entering edit mode
9.2 years ago
5heikki 11k

Assuming no linebreaks in sequences, i.e every record is exactly two lines. I believe for fastq files the multiplier would be 4:

head -n 200000 input > output
ADD COMMENT
0
Entering edit mode

thank you, 5heikki for your help, it is very simple and useful command.

ADD REPLY
0
Entering edit mode

likewise, use tail for the last N sequences.

ADD REPLY
2
Entering edit mode
4.9 years ago
AK ★ 2.2k

Alternatives: seqkit head and seqkit range:

(1) Leading 100000 records:

seqkit head -n 100000 input.fa
seqkit range -r 1:100000 input.fa

(2) Last 100000 records:

seqkit range -r -100000:-1 input.fa

(3) Other ranges:

seqkit range -r 100001:200000 input.fa
ADD COMMENT
0
Entering edit mode
6.5 years ago
REQUESTED_LINES=10
awk "/^>/ {n++} n>$REQUESTED_LINES {exit} {print}" input.fasta > output.fasta
ADD COMMENT
0
Entering edit mode
4.9 years ago
psschlogl ▴ 50
hdl = gzip.open(file, 'rt')
records = SeqIO.parse(hdl, 'fastq')
first_read = next(records)
printfirst_read.id)
ADD COMMENT
1
Entering edit mode

Hi psschlogl

Though your answer shows some potential it does not really answer the question posted. If you can provide a more complete/correct answer we could keep it in the answers thread. If not, we consider removing it.

ADD REPLY

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6