Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

Best strategy starting from allelic chromosomes instead of reference plus VCF
0
0
Entering edit mode
11 months ago
alfonsosaera ▴ 20

Hi all,

I am trying to do genotyping (DNAseq samples) and allele specific expression (RNAseq samples) of a recently assembled plant (alfalfa (Medicago sativa L.)), nature paper here , this is a tetraploid species so everything gets a little more complicated.

Instead of having a reference genome and a VCF, the authors generated an allele-aware chromosome-level genome assembly for the cultivated alfalfa consisting of 32 allelic chromosomes (8 chr x 4 copies), chrom lengths:

>chr1.1
82459472
>chr1.2
86910131
>chr1.3
79881340
>chr1.4
88815615
>chr2.1
76462061
>chr2.2
74215936
>chr2.3
76375162
>chr2.4
76750018
>chr3.1
93149498
>chr3.2
93375939
>chr3.3
96157931
>chr3.4
100414524
>chr4.1
90245664
>chr4.2
93947428
>chr4.3
90228617
>chr4.4
90896203
>chr5.1
81211777
>chr5.2
84165483
>chr5.3
80712490
>chr5.4
78626892
>chr6.1
80303593
>chr6.2
89579199
>chr6.3
84649260
>chr6.4
64534737
>chr7.1
88407277
>chr7.2
93528358
>chr7.3
91580142
>chr7.4
94657719
>chr8.1
87242343
>chr8.2
84274390
>chr8.3
82440740
>chr8.4
81801543


I recently found vg tools and I thing it solves most of the problems I have but I am not sure how to make the analysis. Here is what I have now:

1.- Generate a graph for each chromosome set and then combine them.

Based on the github page I though about something like:

./vg msga -f chr2.fasta -t 50 -k 16 -D -A | ./vg mod -U 10 - -t 50 | ./vg mod -c - -t 50 > chr2.vg


But reading the wiki, a modification of this seems better:

vg msga -f MHC.fa \ # use MHC.fa as our input
-b ref \        # the >ref sequence is our basis
-B 128 \        # align 128-mer bands of the sequences
-K 11 \         # use 16-mers as the basis for our indexing
-X 2 \          # double twice in GCSA2, generating a 44-mer deBruijn graph
-E 3 \          # remove edges from the graph that would allow us to cross 3 bifurcations in 11bp
-G -S 0.95 \    # greedily accept alignments when they match at 95% of our maximum alignment score
-H 5 \          # ignore MEMs which have more than 5 hits
-D \            # basic debugging output, describing progression through the process
>MHC.vg         # write the output to MHC.vg

vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg


How do I set parameters like -B, -K, -E ...?

Now I would have to fuse/concatenate the 8 chromosome_graph.vg to get a genome_graph.vg using vg concat

2.- Augment the graph with info from DNA-seq samples

However reading the github page I am not sure how to do it.

3.- Map RNA-seq samples to augmented graph

I am thinking in a modification of this:

# map reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam

# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -x x.xg -b aln.gam > aln.bam

# or alternatively, surject them to BAM in the call to map
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam


4.- Perform allele specific expression

My idea was to get a VCF from the augmented graph (aug.graph.vcf), but reading the Calling variants using read support section I don't really know how to do it. Next I would extract a linear reference (aug.graph.linear.fasta) from the graph to use GATK ASEReadCounter using the bam from step 3 (aln.bam) as below:

gatk ASEReadCounter -R aug.graph.linear.fasta -I aln.bam -V aug.graph.vcf -O output.table


Thanks for getting to here, I have 3 general questions:

1.- Do you find this approach suitable? would it be better to focus only in coding regions?

2.- I have a modest server, not a cluster or supercomputer, with ~ 200GB RAM and ~40 processors, will I be able to run vg tools for my samples in this machine?

3.- I found toil-vg which seems to make life easier reducing command calls and parallelize vg tools. Is this something I could use in my machine?

4.- At the end of the analysis I need to check gene expression so I would need to add gene annotations to the genome graph from a GFF of the assembled genome, is there a command for this? alternatively, is there a way to do something like a liftover between the published genome and the resulting graph?

Thank you very much!!!!

Alfonso

vg • 318 views