Question

Best way to use k-mers as a predictor in my neural network

0

Entering edit mode

5 months ago

Jimmy ▴ 30

I have my genome as well as some predictors (eg chromosome, GC content, etc) for a response variable, for each window of the genome. I'm working in TensorFlow.

I also want to use the k-mers or maybe k-mer frequency as a predictor. My issue is dimensionality. For example, if I want to one-hot encode all 5-mers, then that is 5! = 120 columns [EDIT: actually 5^5 = 3125], which is still feasible but only captures short-range data. If I want to encode all 10-mers, that is 10^10 columns. This does not seem like the best way to go about it.

I could also think about using a 1D-CNN (not something I've done before) but in this case my understanding is that I would be effectively just feeding in the sequence. I don't see how I could both feed in sequence data as well as features like chromosome, GC content and more.

What is the best way to go about include k-mers of a genomic window as a predictor alongside some of these other features I have mentioned?

convolutional-neural-network kmer • 678 views

ADD COMMENT • link updated 5 months ago by Brian Bushnell 20k • written 5 months ago by Jimmy ▴ 30

1

Entering edit mode

If using a CNN, you'd one-hot encode your nucleotides. There are a few ways to go about including "extra" information: 1) Include them as an extra "nucleotide", or 2) Use "Concatenate" (in Keras)

If you want to use k-mer content to predict whether your organism is bacteria or human, then just use k-mer frequencies rather than a CNN.

If you want to predict cis-regulatory profiles (e.g. scan an input sequence and say: ah hah, here's where chromatin will be accessible), then you use a CNN.

It all depends on your prediction task :)

ADD REPLY • link 5 months ago by dsull ★ 5.9k

1

Entering edit mode

Your math is a bit off. Ignoring reverse-complements, there are 4^5 = 1024 5-mers and 4^10 = 1048576 10-mers. Collapsing reverse-complements gives you 512 5-mers and 524800 10-mers. However, you don't 1-hot encode these, that's for raw sequence. Instead you use the abundance (fraction) as the input for each column.

Overall the choice of kmer length depends on the length of the features you are interested in and how much data you have for training.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

score 1 · Answer 1 · 2023-10-31

1

Entering edit mode

5 months ago

Mensur Dlakic ★ 27k

I think k-mer frequency may be the best way. Don't underestimate how much information is included even at the tetramer level, let alone for pentamers. Not sure what you are doing and why "long range" information is needed, but there will be a solid signal both from 4- and 5-mers.

ADD COMMENT • link 5 months ago by Mensur Dlakic ★ 27k

score 1 · Answer 2 · 2023-11-01

I'd look into some of the transformer-style models like DNABERT2 or The Nucleotide Transformer. We've done this for lncRNA labeling in genome assemblies: https://www.biorxiv.org/content/10.1101/2022.02.09.479647v1.full

We used DNABERT1 back then. All of these models are on HuggingFace, there's a nice benchmark here: https://huggingface.co/spaces/InstaDeepAI/nucleotide_transformer_benchmark