Question

STAR align multiple files

7

Entering edit mode

6.6 years ago

ta_awwad ▴ 340

Hi everybody, I am doing alignment to 36 PE samples using star. to make it little bit easy task I wrote a bash loop to align them all with the same command. here is my loop:

for i in $(ls raw_data); do STAR --genomeDir index.150 \
--readFilesIn raw_data/$i\_1.fq.gz,raw_data/$i\_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

but it seems that something wrong as the alignment took overnight and it was not done yet.

any recommendation

thanks much

RNA-Seq ChIP-Seq alignment STAR • 17k views

ADD COMMENT • link updated 2.4 years ago by Pahi ▴ 20 • written 6.6 years ago by ta_awwad ▴ 340

8

Entering edit mode

For 36 samples, you could speed up by loading the index into memory, and unloading when finished mapping:

STAR --genomeLoad LoadAndExit --genomeDir index.150

for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u)
do
    STAR [...]
done

STAR --genomeLoad Remove --genomeDir index.150

ADD REPLY • link 6.6 years ago by h.mon 35k

0

Entering edit mode

Thank you all for these price less info..

ADD REPLY • link 6.5 years ago by ta_awwad ▴ 340

0

Entering edit mode

Hi h.mon,

Could you tell me what is the purpose of index.150 here? Can we just type the location of the genome after --genomeDir?

ADD REPLY • link 5.2 years ago by c_u ▴ 520

1

Entering edit mode

Yes. In the example given index.150 is the name of the index that was in the original question. Replace that with yours.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

If you load the genome before the for loop using: STAR --genomeLoad LoadAndExit --genomeDir genomeDIR Do you still need to specify the --genomeDir parameter in the loop? I tried leaving that out, and STAR failed to run. Then I tried specifying the genome directory in the loop (even though the genome is loaded before the FOR loop), and it looks like each iteration of the loop is still loading the genome.

Can someone explain how to properly load the genome for multiple samples so that the loop is not iteratively loading it, please?

ADD REPLY • link 5.0 years ago by rls_08 ▴ 40

0

Entering edit mode

First you load the genome using --genomeDir $GENOMEDIR --genomeLoad LoadAndExit. For your alignment(s) you need --genomeDir $GENOMEDIR --genomeLoad LoadAndKeep

ADD REPLY • link 5.0 years ago by WouterDeCoster 47k

2

Entering edit mode

To summarize:

STAR --genomeLoad LoadAndExit --genomeDir index.150

for i in $(ls raw_data); do echo STAR --genomeDir index.150 --genomeLoad LoadAndKeep\
--readFilesIn raw_data/$i\_1.fq.gz,raw_data/$i\_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

STAR --genomeLoad Remove --genomeDir index.150

Is this correct?

ADD REPLY • link 2.6 years ago by DareDevil ★ 4.3k

2

Entering edit mode

When looping, test if your code is valid by adding an echo statement to see what the command is going to be:

for i in $(ls raw_data); do echo STAR --genomeDir index.150 \
--readFilesIn raw_data/$i\_1.fq.gz,raw_data/$i\_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

My guess is that the files raw_data/$i_1.fq.gz don't exist because you create $i simply based on the content of raw_data

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

thanks much WouterDeCoster for your reply. I run your code and got this:

STAR --genomeDir /index.150 --readFilesIn raw_data/KO_day3_1_1.fq.gz_1.fq.gz raw_data/KO_day3_1_1.fq.gz_2.fq.gz --runThreadN 20 --outFileNamePrefix aligned/KO_day3_1_1.fq.gz. --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile GRCm38.90.gtf --readFilesCommand zcat

you are right. the file name became different.

any suggestion to correct this??

thanks much

ADD REPLY • link 6.6 years ago by ta_awwad ▴ 340

1

Entering edit mode

Can you show a few examples of filenames of the fq.gz files?

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

KO_day3_1_1.fq.gz           KO_day4_1_2.fq.gz   mESC_KO_3_1.fq.gz  mESC_KO_3_2.fq.gz      mESC_Wt3_1.fq.gz    mESC_Wt3_2.fq.gz        PG_4WT10_07_17_1.fq.gz    PG_4WT10_07_17_2.fq.gz  PG_7Swht16_07_17_1.fq.gz  PG_7Swht16_07_17_2.fq.gz

ADD REPLY • link 6.6 years ago by ta_awwad ▴ 340

3

Entering edit mode

You could try something like:

for i in $(ls raw_data | sed s/_[12].fq.gz// | sort -u); do echo STAR --genomeDir index.150 \
--readFilesIn raw_data/${i}_1.fq.gz,raw_data/${i}_2.fq.gz \
--runThreadN 20 --outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile GRCm38.90.gtf \
--readFilesCommand zcat ; done

I modified the $i to be shorter, and only keep unique hits since all samples will be in there twice.

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks much ... it is running now .. but I am not sure how much time it will take .. I will inform you if everything run fine

ADD REPLY • link 6.6 years ago by ta_awwad ▴ 340

0

Entering edit mode

it looks like it is stuck .. no progress since 30 minutes .. is it normal???

ADD REPLY • link 6.6 years ago by ta_awwad ▴ 340

0

Entering edit mode

You can have a look with (h)top to see if it's still working. Also, check if it's producing output files.

ADD REPLY • link 6.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi! I know it's an old post, but I'm hoping for some techical help.

I used your code above, with minor changes. But in the the --readFilesIn part no matter what, the files which are used are renamed. My file names are:

Dros_01_S48_L001_R1_001.fastq.gz
Dros_01_S48_L001_R2_001.fastq.gz

etc. Only the S[number] changes and the R1 and R2 in the names.

So I changed the code like:

for i in $(ls 1_raw_data | sed s/_[12].fastq.gz// | sort -u); do echo STAR --genomeDir /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_STAR/ \
--readFilesIn 1_raw_data/${i}_R1_001.fastq.gz 1_raw_data/${i}_R2_001.fastq.gz \
--runThreadN 20 --outFileNamePrefix 3_aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_gtf/Drosophila_melanogaster.BDGP6.32.104.chr.gtf \
--readFilesCommand gunzip -c ; done

My problem, is what mentioned above, that during the readIn part the files what STAR load in named:

Dros_01_S48_L001_R1_001.fastq.gz_R1_001.fastq.gz
Dros_01_S48_L001_R1_001.fastq.gz_R2_001.fastq.gz

This is for the first cycle. What could be the problem? Also I used this code for different files. In that case the file names were 6754726_1.fastq.gz, 6878764_2.fastq.gz etc. (for that I used your original code!)

ADD REPLY • link 2.4 years ago by Pahi ▴ 20

0

Entering edit mode

I think the problem was that STAR doesn't accept compressed files.

ADD REPLY • link 6.5 years ago by ta_awwad ▴ 340

1

Entering edit mode

it accepts but you need to specify : --readFilesCommand zcat

ADD REPLY • link 6.5 years ago by Nicolas Rosewick 10k

0

Entering edit mode

I did .. and it did not work

ADD REPLY • link 6.5 years ago by ta_awwad ▴ 340

0

Entering edit mode

Works just fine for me, use it all the time.

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

"it did not work" doesn't help us know what went wrong, what is the error message? STAR does accept gz compressed files.

ADD REPLY • link 6.5 years ago by h.mon 35k

0

Entering edit mode

just stuck no error message no progress

ADD REPLY • link 6.5 years ago by ta_awwad ▴ 340

2

Entering edit mode

Try gunzip instead. It works with that.

ADD REPLY • link 6.0 years ago by Bog ▴ 20