Frequency of a unique sequence in a fasta file
4
0
Entering edit mode
9.2 years ago

Hi,

I have a FASTA file containing sequences with id as header . The header for all sequence are unique but some sequences are similar to each other. (all.fasta)

To get a FASTA file of unique sequences, I used gt-sequniq command (unique.fasta)

Now I want to find which unique sequence occurs how many times including the header name of similar sequences (Frequency of each unique sequence from all.fasta file).

Thank you

sequencing FASTA • 8.0k views
ADD COMMENT
4
Entering edit mode
9.2 years ago
5heikki 11k

So, you're looking for a clustering algorithm. I would recommend either Usearch or cd-hit..

ADD COMMENT
0
Entering edit mode

I think it's not a clustering algorithm !! its more about frequency distribution of each unique sequence.

ADD REPLY
0
Entering edit mode

i.e. cluster size at 100% identity

ADD REPLY
0
Entering edit mode

This is a MUCH better option.

ADD REPLY
3
Entering edit mode
9.2 years ago

My solution:

Main file: all.fasta

Unique sequence file, generated from main file: unique.fasta

gt sequniq -o unique.fasta all.fasta

R function to find the similar sequence

funiq<-function(all.fa,uniq.fa)
        {
        all<-unlist(all.fa)
        sequ.names <- names(uniq.fa)
        sequ <- NULL
        sequ[sequ.names] <- list(NULL)
        for(i in 1: length(uniq.fa))
            {
            sequ[[i]]<-names(all[which(all%in%(uniq.fa[[i]][1]))])
            }
        return(sequ)
        }

Read FASTA file in R

library(seqinr)
all<-read.fasta("all.fasta",as.string = TRUE,seqtype="AA")
uniq<-read.fasta("unique.fasta",as.string = TRUE,seqtype="AA")

Name of similar sequences:

nam<- funiq(all,uniq)

Result:

$ADC37925
[1] "ADC37925" "EYO75956" "EVE92773"

$AFR73793
[1] "AFR73793" "EPZ11191" "EPZ09632" "EQM92926" "EOR47863" "CAQ50233"

$EFB95474
[1] "EFB95474"

Frequency count:

fcount<-lapply(nam,length)

Result:

$ADC37925
[1] 3

$AFR73793
[1] 6

$EFB95474
[1] 1
ADD COMMENT
0
Entering edit mode

Well, that was rude! Anyway, glad you solved the problem.

ADD REPLY
0
Entering edit mode

Thanks for your help!!

ADD REPLY
1
Entering edit mode

You're welcome, but just so you know, adding your own answer as the accepted answer after people have given you suggestions and answers is rude.

ADD REPLY
0
Entering edit mode

Thanks for your suggestion.

Definitely I highly expect and prefer peoples suggestions and answers. That's why we share our problem and thinking.

But the the meaning of "Accepted " answer is most relevant solution of the problem/question.

is it ok to accept suggestion as solution or not to share my own solution here?

ADD REPLY
0
Entering edit mode

Your interpretation of "Accepted" is one of many possible interpretations, but the point is, as long as answers such as 5heikki's give you an optimal solution to your problem, and have been added in a relevant time frame, adding your own answer and accepting only your answer is equivalent to hosting a party, holding a contest and kinda rigging it to win it yourself. It is a crude anomaly, but you get the gist. It is not bad, just rude.

Consider accepting all answers that get you from your question to the desired answer and do not need much effort (in this case, 5heikki's). This is good for people that may have the same problem in the future, so they have choices when it comes to solutions.

ADD REPLY
0
Entering edit mode

How to export the Name of similar sequences (nam) in csv file?

ADD REPLY
3
Entering edit mode
9.2 years ago

Step 1: Linearize your FASTA file

$ awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' yourfastafile.fasta > yourfastafile_linearized.fasta

Step 2: Dereplicate your FASTA file and keep an account of unique sequences

$ grep -v "^>" yourfastafile_linearized.fasta | 
  grep -v [^ACGTacgt] | \
  sort -d | \
  uniq -c | \
  while read abundance sequence
  do
    hash=$(printf "${sequence}" | sha1sum);
    hash=${hash:0:40};
    printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}";
  done >  yourfastafile_linearized_dereplicated.fasta

The output will be in USEARCH format with size giving the frequency of unique reads:

$ head yourfastafile_linearized_dereplicated.fasta
>dff2e3c82439e27949c51a5e45acae06b4a2cafe;size=1;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAAACTGACTATTCCACTGGAACCACTGAACGGAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCAAGAAGGTGATAAGCAGGAGAAACATACGAAGGCGCATAACGATACCACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACAGCCTTCATAGAAATTTCACGCGGCGG
>a6387c43c45ee62484eafb6c3301a8c0ed0fd546;size=30;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAG
>66a3448cf6f606e3f843ea5f81d5a77d461bbf5e;size=13;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCA
>0746a116007d044ac40d6ead5d8fa029e034c9a6;size=1;

Best Wishes,
Umer

ADD COMMENT
0
Entering edit mode

Thanks.

But I am more interested about the members (name or ID) of each group.

ADD REPLY
0
Entering edit mode

Step #2 doesn't work

ADD REPLY
0
Entering edit mode

When you say something doesn't work, it always helps to give details. A sample of your input, the exact command you ran, and any errors you faced, plus, most importantly, why you think it doesn't "work". Remember, it may seem like it doesn't work while it works perfectly fine.

ADD REPLY
0
Entering edit mode
9.2 years ago
Ram 43k

I wrote a script that deals with the problem that gt-sequniq deals with. You can change it a bit to suit your case: https://github.com/RamRS/myPerlScripts/blob/master/mulFaDeDup.pl

Modify the existing hash to hold sequence:comma-separated list of headers as key:value

Then, add a new hash with header sequence as key and a counter as value, increase counter to number of times the sequence is found. Filter the final hash by value > 1 and write the keys out with their values map the remaining keys back to the previous hash.

This way, you can retain header info as well as get a count.

EDIT; Updated with a HUGE bug fix.

ADD COMMENT

Login before adding your answer.

Traffic: 2473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6