Question

Obtain sequences that did not appear in your fasta file

0

Entering edit mode

5.8 years ago

sicat.paolo20 ▴ 30

Good day, I am trying to identify which sequences did not appear from my list. I have file1.txt which is a list of genes

#gene1
#gene2
#gene3
#gene4
#gene5

I also have file2.fa

#>gene1
ACTAGA
#>gene3
ACATGA
#>gene6
AGATA

I want to be able to identify the genes that are not found in file2.fa based on file1.txt list sample output would be

#gene2
#gene4
#gene5

I tried for i in $(cat file1.txt); do perl -ne '/$i/ && print' file2.fa > output.txt; done it gives everything that appeared in the list. I tried diffirent iterations to get whats not on the list but I wasn't able to. Hope someone could help me with this. Thanks!

genome ngs perl unix bash • 1.4k views

ADD COMMENT • link updated 5.8 years ago by Pierre Lindenbaum 161k • written 5.8 years ago by sicat.paolo20 ▴ 30

0

Entering edit mode

I assume those # are not in your real data since that will break the fasta format.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

yup sorry. It was showing as something else in my laptop before I added the #

ADD REPLY • link 5.8 years ago by sicat.paolo20 ▴ 30

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. You would not need to add # in that case.
code_formatting

Thank you!

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much genomax. Will do that next time.

ADD REPLY • link 5.8 years ago by sicat.paolo20 ▴ 30

score 2 · Accepted Answer · 2018-06-25

2

Entering edit mode

5.8 years ago

Pierre Lindenbaum 161k

comm -23 <(sort file1.txt ) <(grep '>' file2.fa | cut -c 2- | sort)

ADD COMMENT • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thank you very much. follow up question. What if the header in the fasta file has >gene3 <additional information="">. they were skipped. What would be the easiest way to include them?

ADD REPLY • link 5.8 years ago by sicat.paolo20 ▴ 30

1

Entering edit mode

What would be the easiest way to include them?

insert a cut command before sort

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Thanks for the help! PS. Big fan of your one liner scripts btw.

ADD REPLY • link 5.8 years ago by sicat.paolo20 ▴ 30

1

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for the reminder.

ADD REPLY • link 5.8 years ago by sicat.paolo20 ▴ 30

score 2 · Accepted Answer · 2018-06-25

2

Entering edit mode

5.8 years ago

cpad0112 21k

output:

$ grep \> file2.fa | sed 's/>//g' | grep -vf - file1.txt 
gene2
gene4
gene5

input:

 $ cat file2.fa 
>gene1
ACTAGA
>gene3
ACATGA
>gene6
AGATA

$ cat file1.txt 
gene1
gene2
gene3
gene4
gene5

ADD COMMENT • link 5.8 years ago by cpad0112 21k

1

Entering edit mode

using comm,sed and grep:

$ grep \> file2.fa| sed 's/>//g'  | sort | comm -13 - <(sort file1.txt)

gene2
gene4
gene5

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Thanks for the help. This one worked. The first one had an error message about invalid range end. Thanks!!!

ADD REPLY • link 5.8 years ago by sicat.paolo20 ▴ 30