Question

Passing a file to sed to remove all lines that match +1 in file B from file A

0

Entering edit mode

5.3 years ago

sdbaney ▴ 10

Hi, I have a FASTA file that contains sequences from a de novo assembly. We have identified rRNA sequences that we are now trying to remove from that FASTA file. I can remove lines with sed but have to do it per line in a script and I have about 500 sequences that I need to remove. Is there a way that I can write this to take the matching sequences from file B (the rRNA sequences) and remove them and the line following (the actual sequence in the FASTA file) from file A? I have tried grep and comm but grep gives me a byte error and comm didn't make any difference to my files.

Any guidance would be greatly appreciated.

sed grep comm • 1.9k views

ADD COMMENT • link 5.3 years ago by sdbaney ▴ 10

1

Entering edit mode

safer way using seqkit:

seqkit grep -v -f <fileb> input.fasta

ADD REPLY • link 5.3 years ago by cpad0112 21k

0

Entering edit mode

I get it to run successfully and it prints the result to the screen but the result doesn't have anything from fileb removed from it. It's exactly the same as the input. Here is my code:

$ seqkit grep -v -f ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta

ADD REPLY • link 5.3 years ago by sdbaney ▴ 10

score 1 · Answer 1 · 2019-01-07

1

Entering edit mode

5.3 years ago

Noam Teyssier ▴ 80

You can try an fgrep approach which works fairly well

fgrep -v -f file_b.fa file_a.fa  > filtered.fa

ADD COMMENT • link 5.3 years ago by Noam Teyssier ▴ 80

0

Entering edit mode

It keeps returning "illegal byte sequence"

ADD REPLY • link 5.3 years ago by sdbaney ▴ 10

1

Entering edit mode

https://stackoverflow.com/a/19770395/8767800 May help out.

ADD REPLY • link 5.3 years ago by Noam Teyssier ▴ 80

finswimmer · Answer 2 · 2019-01-07

1

Entering edit mode

5.3 years ago

finswimmer 16k

Use seqkit grep:

$ cat input.fasta | seqkit grep -v -f list > new.fa

fin swimmer

ADD COMMENT • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

Hi! so far this one has gotten me the furthest. I get an output file but it is the same as my input file. It hasn't removed any of the sequences that I list in the second file. Here is my code, do you notice anything incorrect?

cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta > out3.fasta

ADD REPLY • link 5.3 years ago by sdbaney ▴ 10

0

Entering edit mode

Hello,

the file following the -f have to contain just the id's of the sequences you like to remove. If you just have a fasta file you can extract these id's with:

$ grep "^>" ~/Desktop/kettinAlignment-wwST-strict-3\ of\ 10hits.fasta|sed 's/^>//' > fastaids.txt

fin swimmer

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

Oh okay! I can easily just have the IDs. Should it be a .txt file?

I performed the following and still returned a fasta file with all of the IDs, none taken out.

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/Desktop/remove.txt > new3.fasta

The text file with the IDs listed.. I tried it once including the > and once without to see if that is what was throwing it off but I still get the same result.

ADD REPLY • link 5.3 years ago by sdbaney ▴ 10

0

Entering edit mode

Please show the first few lines of kettinAlignment-wwST-strict-10hits.fasta and remove.txt.

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta 
>TRINITY_DN25900_c0_g2
CCGTTCTTTTGTACTTGTTATAATCTTTGAAGAAATCTGAGTTTGTTCATCCAGTGAGTG
AACAAGCTAAGATTTCTTCAAAGATTATAACAAGTACAAAAGAACGTAAAGAGGTTGTGT
CTGAAGAAACTAAGATTCAAATTGGAAAATATTAGTTTTTGCTTACTAGAAAAATGAATA
AATGTATGAACATTCATTTACAGTTTCAACAATGATGGTTATGCAGAAAGATTGGATAGT
TGGTAGTCTTTATGATCATGTGTTATCTATTGCCATTGTTCATCTCAAAATATTGATGAA
ATGCATCCAGGCCACTCCCCACTATTCATAGCATGTTTCCCTATTTCCTTCCCTATCTGT
GGAACCATATAAAAAGATAGTTCCACAATCAGAAGAAGTACACCTGAAATTAGCCAGTAC
ATCTGTTGTTCCTACAAAAGAAACTACAGTTGTTATTAGTGAAGAACACAAACCTGAAGA
GAAAGTATCAGTTGTTGTAGCAGAGTCACAAGTTGTGTCTGAAGAAAAGTGTTTGAAGAA
GTTCAATTTGAATATACAGCTGTTGCAACAAATGAATGTGGAAAAGTTACAACTTCAGCA
TACATCACAATTCTAGATCAAAGATGTTCCTTCACAAATGAAAATTAATATTGAATCTAA
ACAAGATTTCTCCAGAAAAAGCAATTGAACTTAAAAAGACAGAGAAAGTAGTTAAAAGAA

ADD REPLY • link updated 5.3 years ago by finswimmer 16k • written 5.3 years ago by sdbaney ▴ 10

0

Entering edit mode

$ cat ~/Desktop/remove.txt 
{\rtf1\ansi\ansicpg1252\cocoartf1561\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 >TRINITY_DN25900_c0_g2\

Hmmm... this is weird... When I open the txt file it just lists

>TRINITY_DN25900_c0_g2
>TRINITY_DN24782_c0_g2

I created a text file via the command line through nano and when I run it I still get my input back out. No sequences removed.

Here is the command

$ cat ~/Desktop/kettinAlignment-wwST-strict-10hits.fasta | seqkit grep -v -f ~/remove.txt > newfasta1.fasta

Here is the text file:

$ cat remove.txt 
>TRINITY_DN25900_c0_g2
>TRINITY_DN24782_c0_g1

ADD REPLY • link updated 5.3 years ago by finswimmer 16k • written 5.3 years ago by sdbaney ▴ 10

0

Entering edit mode

Remove the > at the line starts in remove.txt.

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

Just use seqkit seq -n -i seqs.fa to retrieve IDs.

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

score 0 · Answer 3 · 2019-01-09

0

Entering edit mode

5.3 years ago

sdbaney ▴ 10

I wanted to give an update: I was able to accomplish this by following this post:

faSomeRecords

ADD COMMENT • link 5.3 years ago by sdbaney ▴ 10