Question

Does the mapper can map the illumina reads exactly to the super-scaffolds

0

Entering edit mode

7.9 years ago

liuhui ▴ 20

Hello,

I am working with very fragmented and very large genome, which is over 1,000,000 scaffolds!

I want to assemble the scaffolds into artificial super-scaffolds (where the real scaffolds are separated by stretches of Ns that are a few times as long as the read length,such as 100bp).

My question is whether the mappers, such STAR, gsanp and bwa, can map the illumina reads exactly to the super-scaffolds.

super-scaffolds bwa star mapper gsnap • 2.2k views

ADD COMMENT • link 7.9 years ago by liuhui ▴ 20

0

Entering edit mode

Why do you want to artificially combine the scaffolds? You can do that, of course, but then the output of the mapping would only make sense in context of artificially fused assembly, rather than the real assembly.

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Because of the hundreds of thousands of scaffolds, the speed of mapping would be very slow and will cause trouble for the downstream analysis, such as call snp using GATK.

ADD REPLY • link 7.9 years ago by liuhui ▴ 20

0

Entering edit mode

Have you already tried it, and found it to cause problems with GATK? The mapping speed should not be noticeably affected by decreasing the number of scaffolds... I suppose it depends on the specific implementation of the algorithm, but generally, it's a function of genome size rather than number of scaffolds. BBMap, for example, won't really care if you give it 1,000 fused "super-scaffolds" or 100,000,000 real scaffolds. In some rare cases you may get reads being called proper pairs when actually they span different scaffolds (if you feed a mapper fused "super-scaffolds"), but probably not often enough to cause problems. But if downstream tools break because there are too many scaffolds, then yes, you could combine them, and it should not affect the results much for SNP calling. For other mutation types such as long indels and rearrangements... it would cause problems. If possible, try to find tools that don't break on assemblies with millions of scaffolds, rather than using a convoluted process to force your data through tools that were really designed to work on finished reference genomes; that can be error-prone.

That said, perhaps it would be more useful to first spend time improving the assembly prior to things like SNP-calling? Is this a gigantic hexaploid plant or something like that?

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

I used the genome of loblolly pine (masked and trimmed version,http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/pinerefseq/Pita/v1.01/) to call snp with RNA seq data.

Whether I connected the scaffolds or not, I got errors when using picard-tools-2.4.1 to handle bam file from STAR mapper, such as add read groups and mark duplicates. I have tried to use FixMateInformation to fix the problem, but FixMateInformation threw out errors. And, I have used gsnap, but another error occured to me ...

And now, although I get errors when I use picard to remove duplicates from bam file with bwa (mem) as mapper, I have fixed the problem since I used fastuniq to remove duplicates from fastq file as alternative.

As far as know, bwa mem can align different portions of a read to different locations on the genome (span a splice junction), and I would like to use the bwa mem to map illumina reads (150bp, pair end) to the genome I mentioned.

Since reference genome is fragment and very large (with 2,741 bp and 318,524 bp of average and maximum intron size), could you give me any advice (such as how to set the parameters: -k -w -r -c) to improve the result of mapping or how to estimate the effect of mapping.

(wish to express clearly)

ADD REPLY • link 7.8 years ago by liuhui ▴ 20

0

Entering edit mode

I'd recommend you use BBMap, which is a splice-aware RNA-seq aligner and can produce single alignments spanning long introns, and map to the original (not fused) assembly without problems. You can use the flags "maxindel=320k intronlen=1000". It is generally NOT a good idea to remove duplicates unless your libraries were amplified, which is of course not recommended for RNA-seq, so you should not be doing that unless you know that the libraries were amplified and you do not care about expression levels.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi, Brian, Thank you very much for your advice. In fact, some of my RNA-seq data contained very high proportion of PCR duplicates, so I have to use picard or fastuniq to remove them.

As I mention above, I want to call snp from a bulk of RNA-seq data from illumina. It is a good choice for me because the BBMap is very faster and high sensitivity and specificity.

But I am wondering whether the Picard can handle smoothly the bam file from BBMap (maybe I have to add read groups using picard for downstream analysis)? Could you give me any advice from your experience?

ADD REPLY • link 7.8 years ago by liuhui ▴ 20

0

Entering edit mode

Picard can handle normal files produced by BBMap, and BBMap does support readgroups - you can set them with the "rgid" flag, or add them later. But I would like to note one small thing about the bam format - lh3 can correct me if I'm wrong here, but my understanding is that it has a limit of 2GB total size of header sequences. So if you have enough header sequences, the bam file might be corrupt, no matter what program you use.

In that case, you could perhaps output a sam file, then split the sam file into smaller pieces, each with only 1/10th of the scaffolds, and with only the reads mapping to those scaffolds. Then you could call SNPs on those pieces independently and combine the results... it might be less likely to break a SNP-caller.

Do you happen to know the exact number of scaffolds and how long their headers are?

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

The number of scaffolds is about 1.6 million and the header size is about 60 MB in size.

ADD REPLY • link 7.8 years ago by liuhui ▴ 20

0

Entering edit mode

OK... there should not be any problems, then. If you encounter any, I recommend notifying the authors of the tools.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

That's great, thank you very much.

ADD REPLY • link 7.8 years ago by liuhui ▴ 20

0

Entering edit mode

I know supercontigs and scaffolds. What are "superscaffolds"? What do you mean by "exactly"? Exact match?

ADD REPLY • link 7.9 years ago by lh3 33k

0

Entering edit mode

The "superscaffolds" I mentioned is a artificial scaffold, where the real scaffolds are connected by a number of "Ns" (I think that the concept of "superscaffolds" are the same as supercontigs).

I am wondered that whether the results of alignments will make difference, between the reads mapped to superscaffolds (or supercontigs) and the reads mapped to real scaffolds.

ADD REPLY • link 7.9 years ago by liuhui ▴ 20