Question

RNAseq from circular plasmids

0

Entering edit mode

5.7 years ago

new_user • 0

Does anyone have any suggestions for split read mapping to circular reference genomes, or whether there is a way to modify the Hisat2-Stringtie-Ballgown pipeline so it work a little better in this scenario?

I’ve been analysing some RNA-seq data in which circular plasmids were transfected into human cell lines, with a view to looking at alternative splicing (and isoform discovery). After analysing the data I’ve realised that I am getting both sense and anti-sense transcripts which span position 1 (antisense transcription running from the 5’ end of the linear plasmid sequence to the 3' end) or end of the plasmid sequence (due to read through of a polyA site at the 3’ end of the plasmid sequence into the 5' end of the linear sequence). I have been using Hisat2-Stringtie-Ballgown but these ‘read through’ transcripts are confusing the mapping (read pairs marked as discordant when they aren’t really) and the transcript assembly, which is resulting in exons being mapped which start at position 1 or end at the last nucleotide of the reference sequence. These are then incorrectly assembled into the transcripts.

I thought about stitching a number of the linear sequences together but this would obviously result in non-uniquely mapped reads which would impact the transcript assembly. I guess I could also modify bit-wise flags for a subset of the reads to make them concordant when spanning the the first and last bp but am not sure what effect if any this would have on Stringtie and the assembly of the transcripts.

Any suggestions from anyone who has done anything similar would be welcome!

RNA-Seq alignment assembly • 1.8k views

ADD COMMENT • link 5.7 years ago by new_user • 0

0

Entering edit mode

You could extract the sequences of the gene regions (e.g. bedops gff2bed + bedtools getfasta) and manually add in the regions that overlap the end/start cut. Then index that and run the pipeline on it. Usually, one would just extract the CDS and align with a non-spliced aligner, but since you explicitly mentioned alternative splicing, I think that might work.

ADD REPLY • link 5.7 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thanks for the response. This is what I meant about stitching extra sequence on the ends. I can try that but am pretty sure because reads will then multimap this will mess up the transcript assembly which uses coverage over splice junctions as input. What I really need is a split read mapper and transcript assembly tool which is aware of circular reference sequences.

ADD REPLY • link 5.7 years ago by new_user • 0

1

Entering edit mode

I don't mean stitch extra sequence to the ends. I mean you could do something similar to running RNAseq analysis against a transcriptomic reference. However, instead of giving transcriptomic sequences, you'd provide the full sequences for all gene regions (extracted from the annotation with manually adding in one sequence that you stitch together from the gene boundaries of the gene that overlaps the end/start gap.) Now you can run a spliced aligner against the gene regions and as long as the alternative transcripts stay within the region boundaries, it might just work. In this case I would not expect anything to multi-map unless there are highly similar genes.

ADD REPLY • link 5.7 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Sorry I'm really new to RNAseq so not entirely sure what you mean. The annotation for the circular reference plasmid sequences are provided in gtf file format. I will probably try to rotate the linear sequence so the 'genome' sits more centrally and worry about the plasmid backbone which contains the bacterial ori and antibiotic resistance marker KanR (which I'm not really interested in). Is this what you meant? This is a viral vector plasmid from which longer genomic RNAs are generated (with lots of possible splice variants) as well as an internal transcript driven from a strong promoter.

On a related note I have reads mapping against the antibiotic resistance gene (KanR, driven by a bacterial promoter which shouldn't be active in my mammalian cell line) but I believe these are an artefact of the fact that my cell line also expresses aph (neomycin resistance) which includes homologous sequence to KanR. I have looked at Hisat2 documentation and done a search and soft-masking (lowercase) doesn't seem to be recognised by Hisat2. Do you think I should just hard mask (X or N?) this sequence to prevent reads mapping to this region? (or should I post this as a separate question in its own right!).

ADD REPLY • link 5.7 years ago by new_user • 0

1

Entering edit mode

Ok, I'll try to explain.

You can do RNAseq in two different ways. Either you align against a) a genome or against b) a transcriptome. In the former case you will most likely (except in most bacteria) need to use a splice-aware aligner such as HISAT. In the latter case you can just use a normal off-the-shelf aligner (bwa/bowtie2).

Now, if a circular genome is involved, there might be the situation as you describe: a gene spans the end/start cut (henceforth called the "spanning" gene). This is obviously a problem for a), but not for b), as you'd provide the transcriptomic sequences yourself and the spanning transcript would be present with its proper sequence. With gtf/gff and the genomic sequence you can extract these sequences using e.g. bedops gff2bed and then bedtools bed2fasta. However, you might have to add the sequence of the spanning gene to the resulting fasta yourself, as the composite coordinates from start and end of the sequence might confuse the software.

For your case - circular genome and potential splice activity - I was proposing an alternative to version b). If you expect that your splicing activity will happen within the known gene regions, you can extract the sequences for the genes (just as you would do for the transcripts) and add in the spanning gene. Then, using a splice aware aligner on those gene sequences, you might be able to find your splicing activity. If alternative transcripts are formed outside of the annotated gene boundaries that will not work obviously. I am not sure which is the case for a viral vector plasmid.

Rotating the genome would probably also work, but you need to be aware that you would have to rotate the annotation as well. I think there are tools for that, though, but I cannot recall any specific one.

Concerning the masking, I think this should be a separate question. Hard-masking (N) would probably work.

ADD REPLY • link 5.7 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thanks for the detailed reply. My situation is a) and I think because alternative transcripts are formed outside of the annotated gene boundaries this maybe causing the problem. Also the annotation itself is poor (I could work on the .gtf file but it would all have to be done manually).

Looking at the the reads (sense and antisense XS tags) on IGV I am seeing lots of antisense reads from the 5' end of my sequence, which span the start/end of the sequence). These appear to mostly terminate around the polyA (which I assume is acting bidirectionally on the antisense strand). There are some (but fewer) sense reads 3' of the polyA which are probably due to read through. Some of these transcripts will read around into the 5' end of my linear sequence. Its all these extra reads as well those which are mapping erroneously to the KanR sequence which I believe are confusing Stringtie.

When I've improved my annotation, rotated the genome and repeated my mapping and assembly I'll post an update.

ADD REPLY • link 5.7 years ago by new_user • 0