Hello,
I think my RBP of interest is binding to PolyA tails. I have fastq.gz files from a CLIP-seq experiment and I want to see if any of my reads have high polyA content. I tried using Star Aligner (version 2.5.3a) to index a very short sequence of A's and was successful:
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir PolyA/star_index --genomeFastaFiles PolyA/PolyA.fasta --genomeSAindexNbases 2 --limitGenomeGenerateRAM 33000000000
I then tried aligning my fastq.gz to the indexed polyA genome. The script ran for 24 hrs then aborted. When I aligned to the human genome, it was done in less than 20 minutes. The code I used is below:
STAR --runMode alignReads \
--runThreadN 16 \
--genomeDir ../genomes/PolyA/star_index \
--genomeLoad LoadAndRemove \
--readFilesIn pathtomyfile.fastq.gz \
--readFilesCommand zcat \
--outFilterMultimapNmax 20 \
--outFileNamePrefix myfileout.bam \
--outSAMattributes All \
--outSAMtype BAM Unsorted \
--outFilterMismatchNmax 10
Does anyone have suggestions at to why this failed? or have other recommendations to see how much polyA content I have in my samples?
Thank you!
I do not really see the point in doing that. Polyadenylation is a posttranscriptional modification, means special enzymes put the polyA tail to the pre-mRNA after transcription. The polyA is not part of the gene in the genome so alignment won't help you. I would rather use a dedicated trimming tool such as
bbduk, trimmomatic or cutadapt
to trim polyA tails (please use the search function), and then see how many % of the reads contained that pattern.