Convert Non-Redundant Back To Redundant Fasta File
4
1
Entering edit mode
10.9 years ago

I have a non-redundant FASTA file in following format:

 TCACCCATCGTACCCACTTG    1
 TTTTTGATCCTTCGATGTCGGC    64
 TCTTGAAGTAGAAAAGTTGTGGTT    2
 CGTAAGAATGTCCACAGCCAAGC    1
......

the 2nd column is the abundance of the corresponding read. I would like to have a new FASTA file containing all redundant sequences, i.e. the 1st read appears once; the 2nd appears 64 times; the 3rd one appears twice...

Could anyone help?

perl • 3.8k views
ADD COMMENT
4
Entering edit mode
ADD REPLY
4
Entering edit mode

Here's a hint using Pierre's comment as input:

$ perl -e 'print "http://whathaveyoutried.com\n\n" x 2'

http://whathaveyoutried.com

http://whathaveyoutried.com

ADD REPLY
0
Entering edit mode

I love the people here.

ADD REPLY
0
Entering edit mode

So do I. In case it's not obvious, you would replace the URL in my code above with the DNA string (in the first column of the example input) and the '2' with the appropriate number (in the second column). Perl will autosplit input from the command line (giving you the values in those columns), so you would only need make a slight modification to add a header.

ADD REPLY
8
Entering edit mode
10.9 years ago

Here is a solution using awk:

awk '{for (i=1 ; i <= $2 ; i++) {print ">seq_" NR "_" i "\n" $1}}' dereplicated_file > redundant.fasta

The output looks like that:

>seq_1_1
TCACCCATCGTACCCACTTG
>seq_2_1
TTTTTGATCCTTCGATGTCGGC
>seq_2_2
TTTTTGATCCTTCGATGTCGGC
>seq_2_3
TTTTTGATCCTTCGATGTCGGC
...
ADD COMMENT
0
Entering edit mode
10.9 years ago
Mateusz ▴ 70

Hi,

Do you know python? From technical perspective this file does not look like a FASTA file. You can simply parse it line by line spiting by "\t" and then write it to the new file given number of times.

Something like this:

with open("file.txt", 'r') as ifile:
    with open("file2.txt", 'w') as ofile:
        for line in ifile:
            line = line.strip().split("\t")
            seq = line[0]
            number = int(line[1])
            for i in range(0, number):
                 ofile.write("%s\n" % seq)

Cheers!

ADD COMMENT
0
Entering edit mode
10.9 years ago

With Biopieces www.biopieces.org) do:

read_tab -i test.tab -k SEQ,COUNT | duplicate_record -k COUNT | add_ident -k SEQ_NAME | merge_vals -k SEQ_NAME,COUNT | write_fasta -xo out.fasta
ADD COMMENT
0
Entering edit mode
10.9 years ago
Kenosis ★ 1.3k

Here's a Perl option:

perl -ane 'print ">Seq_$._$_\n$F[0]\n" for 1..$F[1]' inFile > outFile

Partial output on your dataset:

>Seq_1_1
TCACCCATCGTACCCACTTG
>Seq_2_1
TTTTTGATCCTTCGATGTCGGC
>Seq_2_2
TTTTTGATCCTTCGATGTCGGC
...
>Seq_2_63
TTTTTGATCCTTCGATGTCGGC
>Seq_2_64
TTTTTGATCCTTCGATGTCGGC
>Seq_3_1
TCTTGAAGTAGAAAAGTTGTGGTT
>Seq_3_2
TCTTGAAGTAGAAAAGTTGTGGTT
>Seq_4_1
CGTAAGAATGTCCACAGCCAAGC
ADD COMMENT

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6