Question

How to align .fastq files to an Addgene vector sequence?

0

Entering edit mode

6.2 years ago

Kristin Muench ▴ 620

Hello,

I would like to find out if any reads in my .fastq file were transcribed from a vector sequence that is not in the human reference genome.

The sequence is available here: https://www.addgene.org/27080/sequences/#depositor-full

I thought I might try aligning the .fastq files to the sequence using TopHat, as if the sequence were a human reference genome, and seeing if any alignments pop up.

However, I'm not sure how to go about doing this.

Should I make the sequence above into a .gtf file somehow?

How do I make the corresponding annotation (.gff) file?

Is there an easier way to go about doing this? E.g., isolating every sequence in the .fastq file and using grep to search for it in the vector sequence?

Thank you!

EDIT: I am aligning Illumina RNA-Seq .fastq files. Also, I'd appreciate any resources folks have regarding how to modify or add chromosomes to a reference file!

RNA-Seq alignment • 2.5k views

ADD COMMENT • link 6.2 years ago by Kristin Muench ▴ 620

1

Entering edit mode

My standard answer when someone mentions Tophat:
You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

Please stop using Tophat https://t.co/Es4ohxOEyx Cole and I developed the method in *2008*. It was greatly improved in TopHat2 then HISAT & HISAT2. There is no reason to use it anymore. I have been saying this for years yet it has more citations this year than last #methodsmatter
— Lior Pachter (@lpachter) December 2, 2017

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Eek, thank you for the heads-up! I'll take a look at those other methods.

ADD REPLY • link 6.2 years ago by Kristin Muench ▴ 620

score 2 · Answer 1 · 2018-02-01

2

Entering edit mode

6.2 years ago

WouterDeCoster 47k

I believe the most correct would be to add an additional chromosome, your vector, to the human genome and index that for alignment using e.g. STAR or HISAT2 (assuming you have Illumina RNA-seq data, which you did not specify).

Aligning only to the vector (without the human genome) or vice versa might lead to false positive/negative alignments.

ADD COMMENT • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Great point! And yes, I am using Illumina RNA-Seq data.

ADD REPLY • link 6.2 years ago by Kristin Muench ▴ 620

score 1 · Answer 2 · 2018-02-01

1

Entering edit mode

6.2 years ago

GenoMax 141k

You could use bbsplit.sh from BBMap suite to bin the reads to quickly find out which ones are from human genome and which are from vector. You can decide how to handle those reads that multi-map (both within and across the genomes).