Question

How to extract the last 1000 nt from a group of sequences in a FASTA file?

0

Entering edit mode

5.2 years ago

Andrei ▴ 10

Hi everyone!

I would like to know how I can extract the 1000 last nucleotides from sequence in a FASTA file and their headers using linux command line.

There are about 2000 sequence in this FASTA file and need to extract the last 1000 nucleotides from each sequence keeping their original headers.

I am a beginner at bioinformatics and I am still learning how to manipulate FASTA files, and I'm stuck with this problem and I can not solve it.

Thanks!

sequence • 4.5k views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 161k • written 5.2 years ago by Andrei ▴ 10

2

Entering edit mode

seqkit subseq supports this, if you want a fast solution.

seqkit subseq -r -1000:-1 seqs.fa > result.fa

If you want to learn programming, write some Python scripts using Biopython.

ADD REPLY • link 5.2 years ago by shenwei356 8.4k

1

Entering edit mode

See this post, which uses awk to do some similar: https://stackoverflow.com/questions/30264809/trim-first-n-bases-in-multi-fasta-file-with-awk-and-print-with-max-width-format

The command in the last code block should be somewhat easy to modify. Instead of using the left_trim argument to trim the first 1000 bases as in the example, you could modify the script to keep the last 1000 bases (something like replacing start=left_trim+1; for strat=length(rec) - left_trim + 1). Try it out and test on some examples.

ADD REPLY • link 5.2 years ago by manuel.belmadani ★ 1.3k

1

Entering edit mode

I could do it! I'm very thankful for your help.

ADD REPLY • link 5.2 years ago by Andrei ▴ 10

0

Entering edit mode

Could you please paste here the explanation that you did about sed syntax?

ADD REPLY • link 5.2 years ago by Andrei ▴ 10

1

Entering edit mode

I removed it because it didn't handle sequences that spanned multiple lines, which is common for FASTA files. sed works line by line, so awk was more appropriate in that use case. But the command was cat file.txt | sed 's|^[^>].*\(.{1000}\)|\1|g, if you're intrested. Briefly, sed 's|PATTERN|REPLACEMENT|g' will match lines that have PATTERN and substitute it with REPLACEMENT. The pattern is a regular expression ^[^>].*\(.{1000}\), which means from the a line that begins (^) with anything except a > (^[^>]) followed by anything of unlimited length .* until you can't match anything but 1000 characters of anything (.{1000}, and see greedy matching which applies here.). The last part is wrapped in parenthesis (\(.{1000}\), or (GROUP)) where GROUP is a part of the regex you want to keep for later. The REPLACEMENT part just says "replace with group 1 " (\1). The backslashes before { } ( ) are there to say you don't want to match those characters, instead they are part of the regular expression syntax. I'd highly recommend learning more about sed and regexes for bioinformatics, they're really useful and widely used.

ADD REPLY • link 5.2 years ago by manuel.belmadani ★ 1.3k

score 1 · Answer 1 · 2019-02-08

1

Entering edit mode

5.2 years ago

5heikki 11k

This requires Biopython. Save it as extractEnds.py and chmod +x extractEnds.py to make it executable and then just run it

#!/usr/bin/python

import sys
from sys import argv
from Bio import SeqIO

if (len(sys.argv) < 2):
    print("Usage: extractEnds.py file.fna")
    sys.exit()
else:
    FNAME = argv[1]
    IFILE = open(FNAME, "rU")

for record in SeqIO.parse(IFILE, "fasta"):
    print(">%s\n%s" % (record.description, record.seq[-1000:]))

IFILE.close()

I'm quite a Python noob so if there's a way to make this better or there are mistakes, please let me know..

ADD COMMENT • link 5.2 years ago by 5heikki 11k

2

Entering edit mode

No need for open(FNAME, "rU") or IFILE.close(). SeqIO handles all opening and closing itself.

You can modify this simply to:

for record in SeqIO.parse(sys.argv[2], "fasta"):

ADD REPLY • link 5.2 years ago by Joe 21k

1

Entering edit mode

Thanks. What's the difference between (argv[1], ..) and (sys.argv[1], ..)?

edit. nevermind, found it, it just depends on what gets imported in the start..

edit2. For years I sticked with epic GNU coreutil oneliners, but I'm starting to see the advantages of biopython. Especially if you write scripts that someone else inherits some day.. they will be much happier if you used biopython

ADD REPLY • link 5.2 years ago by 5heikki 11k

3

Entering edit mode

If you want it to be robust, or you want the code to be used by anyone but yourself (especially if they're a novice), a dedicated parser like BioPython which covers lots and lots of edge cases is worth its weight in gold.

You could write that code in a reasonably concise one-liner too (this one also has the benefit of taking a second CLI arg for any length of sequence):

python3 -c "import sys; from Bio import SeqIO; [print(f'>{r.id}\n{r.seq[-int(sys.argv[2]):]}') for r in SeqIO.parse(sys.argv[1],'fasta')] ;" seqs.fa 1000

(Python 3 only because of a print-inside-list-comprehension and use of fstrings.

ADD REPLY • link 5.2 years ago by Joe 21k

score 1 · Answer 2 · 2019-02-08

1

Entering edit mode

5.2 years ago

Pierre Lindenbaum 161k

linearize and use a second awk to extract the last 100 characters.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta |\
awk -F '\t' '{x=100;L=length($2);printf("%s\n%s\n",$1,(L<=x?$2:substr($2,1+L-x,x)));}'

ADD COMMENT • link 5.2 years ago by Pierre Lindenbaum 161k