Question

How can I make GATK fasta alternate reference maker faster

0

Entering edit mode

5.7 years ago

joomi ▴ 10

Hello,

I'm in the process of creating an alternate reference genome by using a VCF file and a reference genome that are inputed into GATK's faster alternate reference maker. The whole process takes longer than I wanted and I realize this is because it is making changes to repeat regions and places that are not transcripts. I then took the intervals from the gtf file used the -L option to focus only on transcript regions. However this outputs a file with only the transcripts and not the standard genome.fa file where each chromosome is by itself and referenced as ">10" for example. Now there are multiple ">10" and I think those are all the transcripts by themselves. This might be a trivial question. I'm new to this and wanted to ask for advice. I also know that I can post this on the GATK website but I wanted to also see if there are alternate ways/platforms to use that might be faster. Is there a way to make GATK's faster alternate reference maker faster and make changes? Or is there another format I can use? Also is there a way to tell it to focus only on transcripts, but to output a normal genome faster format file?

Thank you very much in advance!

RNA-Seq • 2.5k views

ADD COMMENT • link updated 5.7 years ago by FGV ▴ 170 • written 5.7 years ago by joomi ▴ 10

1

Entering edit mode

GATK faster alternate reference maker

Surely, you mean GATK FastaAlternateReferenceMaker

What is the exact command you're using right now? GATK specifies parallelization options in the common CommandLineTools page

ADD REPLY • link 5.7 years ago by Ram 43k

0

Entering edit mode

Thank you for responding!!

I'm using the following command:

srun java -jar /share/apps/GATK-3.6/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -V $Files/PT_sanger2yTCA454RNvsB73_with_read_groups_filtered_refina2_snps.ann.gatk.vcf -R $Files/Zea_mays.AGPv3.22.dna.genome.fa -L $Files_made/VCF.intervals -o $Files_made/$names/PT_genome_intervals.fa

I made the file VCF.intervals using a python script I wrote that takes the coordinates from the GTF file and sorts it based on amending order. I also made the file in the format they mention on their website. I have the website linked here: GATK Interval File Format

I used the B format GATK-style .list or .intervals and here is a snippet of what my VCF.intervals file looks like:

10:300326-300577
10:301453-301600
10:300326-300886
10:301112-301160

By the way, in my file there is no space between the lines. Here I don't know why it places the lines next to each other with just one space. I know I used chromosome 10 multiple times, but I don't know of a way to say it once and indicate the interval ranges on one line, without telling it to go through the whole genome again and take the same amount of time as before. The problem is my output file is vastly different from when I don't have intervals. There are multiple chromosome 10s again and I think those are just the transcript ranges and not the full genome it should output.

Thank you and please help!!

ADD REPLY • link updated 5.7 years ago by GenoMax 141k • written 5.7 years ago by joomi ▴ 10

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

Thank you!

ADD REPLY • link 5.7 years ago by GenoMax 141k

score 0 · Answer 1 · 2018-07-31

I'd suggest splitting your analysis by chromosome/scaffold/contig (or a group of them), and running each one of them in parallel. In the end you can just concat all the files.

However, be careful with GATK FastaAlternateReferenceMaker, since it will only replace on the reference alternate alleles on your sample. Regions with missing data will always be exactly like the reference.

cheers,