Obtain sequences that did not appear in your fasta file
2
0
Entering edit mode
5.8 years ago

Good day, I am trying to identify which sequences did not appear from my list. I have file1.txt which is a list of genes

#gene1
#gene2
#gene3
#gene4
#gene5

I also have file2.fa

#>gene1
ACTAGA
#>gene3
ACATGA
#>gene6
AGATA

I want to be able to identify the genes that are not found in file2.fa based on file1.txt list sample output would be

#gene2
#gene4
#gene5

I tried for i in $(cat file1.txt); do perl -ne '/$i/ && print' file2.fa > output.txt; done it gives everything that appeared in the list. I tried diffirent iterations to get whats not on the list but I wasn't able to. Hope someone could help me with this. Thanks!

genome ngs perl unix bash • 1.4k views
ADD COMMENT
0
Entering edit mode

I assume those # are not in your real data since that will break the fasta format.

ADD REPLY
0
Entering edit mode

yup sorry. It was showing as something else in my laptop before I added the #

ADD REPLY
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. You would not need to add # in that case.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Thank you very much genomax. Will do that next time.

ADD REPLY
2
Entering edit mode
5.8 years ago
comm -23 <(sort file1.txt ) <(grep '>' file2.fa | cut -c 2- | sort)
ADD COMMENT
0
Entering edit mode

thank you very much. follow up question. What if the header in the fasta file has >gene3 <additional information="">. they were skipped. What would be the easiest way to include them?

ADD REPLY
1
Entering edit mode

What would be the easiest way to include them?

insert a cut command before sort

ADD REPLY
1
Entering edit mode

Thanks for the help! PS. Big fan of your one liner scripts btw.

ADD REPLY
1
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Thanks for the reminder.

ADD REPLY
2
Entering edit mode
5.8 years ago

output:

$ grep \> file2.fa | sed 's/>//g' | grep -vf - file1.txt 
gene2
gene4
gene5

input:

 $ cat file2.fa 
>gene1
ACTAGA
>gene3
ACATGA
>gene6
AGATA

$ cat file1.txt 
gene1
gene2
gene3
gene4
gene5
ADD COMMENT
1
Entering edit mode

using comm,sed and grep:

$ grep \> file2.fa| sed 's/>//g'  | sort | comm -13 - <(sort file1.txt)

gene2
gene4
gene5
ADD REPLY
0
Entering edit mode

Thanks for the help. This one worked. The first one had an error message about invalid range end. Thanks!!!

ADD REPLY

Login before adding your answer.

Traffic: 2509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6