Question

How to merge similar transcripts together?

1

Entering edit mode

8.4 years ago

satshil.r ▴ 50

Hello,

I have 8 transcriptomes which I assembled. 4 are from a gonad and 4 are from an ovary. There are also 2 groups within each tissue. I want to create a reference transcriptome containing the contigs from all 8 of my samples, annotate the reference and use it for DGE.

I cat'd the 8 files together and I have a 2GB, 1.8million contig file. I ran cd-hit-est with the following options:

"cd-hit-est -i reference.fa -o reference_90.fa -c 0.9 -T 0 -M 0"

The resulting file was still large with around 1.3M contigs and 1.6GB. How can I merge these remaining contigs more? When I annotated this 1.3M contig fasta I had multiple repeats for almost all genes, so there has to be a more concise way to merge similar transcripts together. Can someone help me out with this problem?

cd-hit fasta Assembly RNA-Seq • 2.5k views

ADD COMMENT • link updated 8.4 years ago by Brian Bushnell 20k • written 8.4 years ago by satshil.r ▴ 50

0

Entering edit mode

1.6Gb for a transcriptome? 1.3 million contigs? This either is not a good assembly, or you are working with some crazy organism. What kind of organism are you working on?

ADD REPLY • link 8.4 years ago by h.mon 35k

0

Entering edit mode

It's a multi-kmer approach, so it's technically multiple assemblies combined together. That's why I want to reduce the redundancy. The Assembly is good,

ADD REPLY • link 8.4 years ago by satshil.r ▴ 50

Ram · Answer 1 · 2015-11-12

0

Entering edit mode

8.4 years ago

Brian Bushnell 20k

The BBMap package has a program called Dedupe that collapses exact or contained duplicate sequences. Usage:

dedupe.sh in=contigs.fa out=nodupes.fa

If you want to avoid absorbing different-length transcripts, you can use the flag minlengthpercent=90 or similar; if you want to allow up to 3 substitutions difference, you can ushe the flag s=1; etc.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k