Question

Generating artificial NGS reads from a FASTA file specifying read length, coverage, and start sequence.

0

Entering edit mode

5.1 years ago

Hamish ▴ 40

I am looking to generate artificial single-end Illumina sequencing data from a FASTA containing a number of short sequences representing unique molecular identifiers. However, I've found limited information on this online and need some help. The start of the input FASTA resembles this (in practice there will be many more sequences):

>1 dna:chromosome
ACCTTAGCAGGT
TCGAACCTGTGA
CGAGTCAGTCTA

And the desired output file is a FASTQ file resembling this:

@HWI-ST745_0097:7:1101:1001:1000#0/1
ACCTTAGCAGGT
+HWI-ST745_0097:7:1101:1001:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1002:1000#0/1
TCGAACCTGTGA
+HWI-ST745_0097:7:1101:1002:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1003:1000#0/1
CGAGTCAGTCTA
IIIIIIIIIIII

Importantly for me, each UMI sequence read needs to be 12 bases long, sequenced in order (i.e. starting with the first UMI in the FASTA) and without overlap, i.e. every UMI is read once and each read in the FASTQ is a mirror of the sequence in the FASTA. Basically, I need a version of the FASTA input file with Phred quality scores. I'm relatively new to bioinformatics and NGS and so far I've been looking at 2 programs to achieve this: Artificial fastq generator and ART.

My artfastgen input looks like this:

java -jar ArtificialFastqGenerator.jar -O output.fastq -R input.fasta -S ">1 dna:chromosome" -RL 12 -TLM 12 -GCC False

The (partial) output is as below:

@HWI-ST745_0097:7:1101:1001:1000#0/1
ACCTTAGCAGGT
+HWI-ST745_0097:7:1101:1001:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1002:1000#0/1
CCTTAGCAGGTT
+HWI-ST745_0097:7:1101:1002:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1003:1000#0/1
CTTAGCAGGTTC

This is very close to what I actually need. The only problem is that each read shifts across one base at a time, whereas I need it to shift 12 (or to the start of the next UMI in the input FASTA).

My ART input is this:

art_illumina -ss HS25 -i input.fasta -l 12 -f 1 -o output_dat

Example output varies with each run and is like this:

@1-ACC3
TAGCAGGTTCGA
+
CCCCCGGGGGGG
@1-ACC2
CAGGTTCGAACC
+
CCCBCGGGEGGG
@1-ACC1
CGTCACAGGTTC
+
BBBCCGBGGGGG

The issue here is that, unlike artfastqgen which allows you to specify the start of the read, ART chooses a random point in the FASTA as the starting point. Like artfastqgen, there is unwanted overlap in the read sequences.

Can anyone hear offer any suggestions on how to achieve this? Ideally using artfastqgen, but any other Linux-based program would also be fine. Thank you.

next-gen sequencing • 3.5k views

ADD COMMENT • link updated 5.1 years ago by Gabriel R. ★ 2.9k • written 5.1 years ago by Hamish ▴ 40

1

Entering edit mode

I don't believe that there are any fastq file simulation programs available that will generate UMI's along with the reads at this time.

ADD REPLY • link 5.1 years ago by GenoMax 141k

1

Entering edit mode

Though I generally second genomax opinion, I would like to add a very comprehensive Nature Genetics Review on the simulation topic:

Escalona et al, 2018: A comparison of tools for the simulation of genomic next-generation sequencing data

ADD REPLY • link 5.1 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Since posting I've found a way to achieve this using ART, but this looks like a very relevant and helpful review. Thanks for the input.

ADD REPLY • link 5.1 years ago by Hamish ▴ 40

score 2 · Accepted Answer · 2019-03-04

2

Entering edit mode

5.1 years ago

Gabriel R. ★ 2.9k

I would use the approach we used for gargammel (https://grenaud.github.io/gargammel/). I would use the following options:

art_illumina  -amp -na   -c 1

so -amp says that you do an "amplicon sequencing simulation", in your case, your amplicons are your starting sequences. -na means "do not output ALN alignment file" because we do not need it and -c 1 to generate 1 reads / input sequence.

hope this helps

ADD COMMENT • link 5.1 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

Thanks for the advice and apologies for my delayed reply. I modified my input command per your suggestion to art_illumina -i input.fasta -na -amp -l 12 -c 1 -o outputx_dat. After some additional troubleshooting, I realised there was a problem with the format of the fasta file I was using. It now has multiple headers dividing each sequence to be read, i.e.

>1
ACCTTAGCAGGT
>2
TCGAACCTGTGA
>3
CGAGTCAGTCTA

ART now processes each read once and in order. Thanks again for the input.

ADD REPLY • link 5.1 years ago by Hamish ▴ 40