Passing a file to sed to remove all lines that match +1 in file B from file A
3
0
Entering edit mode
5.3 years ago
sdbaney ▴ 10

Hi, I have a FASTA file that contains sequences from a de novo assembly. We have identified rRNA sequences that we are now trying to remove from that FASTA file. I can remove lines with sed but have to do it per line in a script and I have about 500 sequences that I need to remove. Is there a way that I can write this to take the matching sequences from file B (the rRNA sequences) and remove them and the line following (the actual sequence in the FASTA file) from file A? I have tried grep and comm but grep gives me a byte error and comm didn't make any difference to my files.

Any guidance would be greatly appreciated.

sed grep comm • 1.9k views
ADD COMMENT
1
Entering edit mode

safer way using seqkit:

seqkit grep -v -f <fileb> input.fasta

ADD REPLY
0
Entering edit mode

I get it to run successfully and it prints the result to the screen but the result doesn't have anything from fileb removed from it. It's exactly the same as the input. Here is my code:

$ seqkit grep -v -f ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta
ADD REPLY
1
Entering edit mode
5.3 years ago

You can try an fgrep approach which works fairly well

fgrep -v -f file_b.fa file_a.fa  > filtered.fa
ADD COMMENT
0
Entering edit mode

It keeps returning "illegal byte sequence"

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode
5.3 years ago

Use seqkit grep:

$ cat input.fasta | seqkit grep -v -f list > new.fa

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi! so far this one has gotten me the furthest. I get an output file but it is the same as my input file. It hasn't removed any of the sequences that I list in the second file. Here is my code, do you notice anything incorrect?

cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta > out3.fasta
ADD REPLY
0
Entering edit mode

Hello,

the file following the -f have to contain just the id's of the sequences you like to remove. If you just have a fasta file you can extract these id's with:

$ grep "^>" ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta|sed 's/^>//' > fastaids.txt

fin swimmer

ADD REPLY
0
Entering edit mode

Oh okay! I can easily just have the IDs. Should it be a .txt file?

I performed the following and still returned a fasta file with all of the IDs, none taken out.

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/Desktop/remove.txt > new3.fasta

The text file with the IDs listed.. I tried it once including the > and once without to see if that is what was throwing it off but I still get the same result.

ADD REPLY
0
Entering edit mode

Please show the first few lines of kettinAlignment-wwST-strict-10hits.fasta and remove.txt.

ADD REPLY
0
Entering edit mode
$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta 
>TRINITY_DN25900_c0_g2
CCGTTCTTTTGTACTTGTTATAATCTTTGAAGAAATCTGAGTTTGTTCATCCAGTGAGTG
AACAAGCTAAGATTTCTTCAAAGATTATAACAAGTACAAAAGAACGTAAAGAGGTTGTGT
CTGAAGAAACTAAGATTCAAATTGGAAAATATTAGTTTTTGCTTACTAGAAAAATGAATA
AATGTATGAACATTCATTTACAGTTTCAACAATGATGGTTATGCAGAAAGATTGGATAGT
TGGTAGTCTTTATGATCATGTGTTATCTATTGCCATTGTTCATCTCAAAATATTGATGAA
ATGCATCCAGGCCACTCCCCACTATTCATAGCATGTTTCCCTATTTCCTTCCCTATCTGT
GGAACCATATAAAAAGATAGTTCCACAATCAGAAGAAGTACACCTGAAATTAGCCAGTAC
ATCTGTTGTTCCTACAAAAGAAACTACAGTTGTTATTAGTGAAGAACACAAACCTGAAGA
GAAAGTATCAGTTGTTGTAGCAGAGTCACAAGTTGTGTCTGAAGAAAAGTGTTTGAAGAA
GTTCAATTTGAATATACAGCTGTTGCAACAAATGAATGTGGAAAAGTTACAACTTCAGCA
TACATCACAATTCTAGATCAAAGATGTTCCTTCACAAATGAAAATTAATATTGAATCTAA
ACAAGATTTCTCCAGAAAAAGCAATTGAACTTAAAAAGACAGAGAAAGTAGTTAAAAGAA
ADD REPLY
0
Entering edit mode
$ cat ~/Desktop/remove.txt 
{\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 >TRINITY_DN25900_c0_g2\

Hmmm... this is weird... When I open the txt file it just lists

>TRINITY_DN25900_c0_g2
>TRINITY_DN24782_c0_g2

I created a text file via the command line through nano and when I run it I still get my input back out. No sequences removed.

Here is the command

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/remove.txt > newfasta1.fasta

Here is the text file:

$ cat remove.txt 
>TRINITY_DN25900_c0_g2
>TRINITY_DN24782_c0_g1
ADD REPLY
0
Entering edit mode

Remove the > at the line starts in remove.txt.

ADD REPLY
0
Entering edit mode

Just use seqkit seq -n -i seqs.fa to retrieve IDs.

ADD REPLY
0
Entering edit mode
5.3 years ago
sdbaney ▴ 10

I wanted to give an update: I was able to accomplish this by following this post:

faSomeRecords

ADD COMMENT

Login before adding your answer.

Traffic: 2840 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6