Biostar Beta. Not for public use.
Extract According To Row
0
Entering edit mode
6.3 years ago
2011101101 • 100

I have a query document in fasta format. I want to extract some sequences, according to a row from another document. The two document are large.

For example. the query document is like below.enter link description here

>1
AAAAAAAAAAAAACAGTTGGCATG
>2
AAAAAAAAAAAAACCGAGTACCGTTCACGCC
>3
AAAAAAAAAAAAACCTTGAAC

The other document is like below.

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAAGCTCGGAT    1    0    0    0    0    0    0    0
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
AAAAAAAAAAAAACGGGATTC    0    0    0    0    1    0    0    0
AAAAAAAAAAAAACTCAGTTCTGCCT    0    0    0    0    0    1    0    0

The expected result is the following:

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
• 1.7k views
ADD COMMENTlink
2
Entering edit mode

It would be nice if you could try out suggested solutions and let us know which one performed best? I would be interested to see time comparison of solutions offered by Poe and Pierre Lindenbaum. Thanks

ADD REPLYlink
3
Entering edit mode
14 months ago
PoGibas 4.8k
Vilnius

head -n 1 another_document > result && grep -v '>' query_document | grep -F -f - another_document >> result

ADD COMMENTlink
0
Entering edit mode

grep -F -f - another_document

ADD REPLYlink
1
Entering edit mode

grep -F -f to extract patterns between documents - http://stackoverflow.com/a/11490467/1286528.
- is piped pattern that you want to extract (motif sequences).

ADD REPLYlink
2
Entering edit mode
15 months ago
France/Nantes/Institut du Thorax - INSE…
head -n 1 other.tsv > result
sort -t '    ' -k1,1 other.tsv | join  -t '    ' -1 1 -2 1 <(grep -v ">" file.fa | sort -u ) - >> result
ADD COMMENTlink
0
Entering edit mode

+1 I was just thinking of using Join as well, but then I saw your answer :)

ADD REPLYlink
1
Entering edit mode
16 months ago
Seattle, WA USA

Put the FASTA sequences into a hash table, and print out rows from the matrix file if the motif field element is defined in the hash table:

#!/usr/bin/env perl

use strict;
use warnings;

my $fastaFn = $ARGV[0];
my $masterFn = $ARGV[1];

my $seqsRef;
my $header;
my $sequence;

open FASTA, "< $fastaFn" or die "could not open FASTA file\n";
while (<FASTA>) {
    chomp;
    if (/>/) {
        $header = $_;
        $header =~ s/^>//;
    }
    else {
        $sequence = $_;
        $seqsRef->{$sequence} = $header;
    }
}
close FASTA;

open MASTER, "< $masterFn" or die "could not open master file for filtering\n";
my $ln = <MASTER>;
print STDOUT "$ln\n";
while (<MASTER>) {
    chomp;
    my @elems = split("\t", $_);
    my $motif = $elems[0];
    if (defined $seqsRef->{$motif}) {
        print STDOUT "$_\n";
    }
}
close MASTER;

To use it:

$ filter.pl myQuerySeqs.fa myDataMatrix.mtx > myFilteredMatrix.mtx

The file myQuerySeqs.fa is your FASTA file. The myDataMatrix.mtx file is the "master" matrix file that you want to filter on sequences from the FASTA file. Output is sent to myFilteredMatrix.mtx.

This should be fairly fast, if memory-intensive, because hash table lookups are in constant time.

ADD COMMENTlink
0
Entering edit mode

Because my document is very large,how to get the myDataMatrix.mtx?

ADD REPLYlink
0
Entering edit mode

You already have myDataMatrix.mtx (at least, if I understand your original question correctly).

ADD REPLYlink
0
Entering edit mode

Yes,I understand ,thank you

ADD REPLYlink
0
Entering edit mode

Don't forget to accept your answer when when you find the right solution for you

ADD REPLYlink
0
Entering edit mode
3.5 years ago
Whetting ♦ 1.5k
Bethesda, MD

Not the most elegant, but this should do it (unless your Fasta file is too big for memory?)

from Bio import SeqIO
tags=[]
for seq_record in SeqIO.parse("in.fas", "fasta"):
    if str(seq_record.seq) not in tags:
        tags.append(str(seq_record.seq))


for tag in tags:
    with open("2.txt","rU") as f:
        for line in f:
            line=line.rstrip()
            if tag in line:
                print line
ADD COMMENTlink
0
Entering edit mode

how to use it ?

ADD REPLYlink
0
Entering edit mode

save the file as "rows.py" run in from terminal as "python rows.py"

ADD REPLYlink
0
Entering edit mode
6.3 years ago
Agatha • 340

I am not sure how big are your files but if R can handle them, then you can use :

require("Biostrings")
sequence_data<<-read.DNAStringSet("file1.fasta")
motifs<-read.table("file2.txt",header=T)
tab3<-subset(motifs, motifs$motif%in%as.character(sequence_data))

> tab3
         motif MZQ1 MZQ3 MZQ4 MZQ5 MZQ6 MZQ7 MZQ8 MZQ2
2        AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
3 AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
4           AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
ADD COMMENTlink
0
Entering edit mode
ADD REPLYlink
1
Entering edit mode

That is true, if sequence _data ( DNAStringSet object) would be converted to a data_ frame...

ADD REPLYlink
0
Entering edit mode
14 months ago
Nari • 870
United States
 #!/usr/bin/perl -w
print"Enter REFERENCE file: ";
chomp($file=<STDIN>);
open(FH,$file);
@org_det=<FH>;
print"Enter QUERY file: ";
$hspfile=<STDIN>;
open(FH1,$hspfile);
@hsporg=<FH1>;
print "enter output file : ";
$OUT = <STDIN>;
chomp($OUT);
open(OUT1,">$OUT");

foreach(@hsporg)
{
    @org=split('\t',$_);
    chomp($org=$_);
    foreach(@org_det)
    {
        @orginfo=split('\t',$_);
        if($org=~$orginfo[0])

        {
            print OUT1 "$_";
        }
            }
    }
close FH;
close FH1;
close OUT1;
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3