Demultiplexing Illumina data
1
6
Entering edit mode
6.0 years ago
dbro970 ▴ 60

Hi, I'm fairly new to bioinformatics in general but have some Illumina Sequence data in the form of two sequence files and a seperate barcode file. I’m trying to demultiplex this data, but can't seem to find a way to do it? Can anyone give me some advice?

Read File 1:

@HWI-M01156:58:000000000-A14TV:1:1101:15941:1367 1:N:0:
TACGTAGGGTGCGGGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGTTATTTAAGTGAGGTGTGAAATCCCCGGGCTTAACCTGGGAATTGCATTTCTGACTGGGTAACTTGAGTACTTTTGGGGGGGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGGGGAATACCGAAGGCGAAGGCAGCCCCTTGGGATTGTACTGACGCCCTTGTGGGAAAGGGGGGGGGGCAAACG

Read File 2:

@HWI-M01156:58:000000000-A14TV:1:1101:15941:1367 3:N:0:
CCTTTTTTCTCCCCACTCTTTCGCTCTTTTTCTTCTTTTCTTTCCCTTTTTTTTTCCTTCGCCTTCTTTTTTCCTCCTCATCTCTTCGCTTTTCACCGCTNNNCNNNTNNTTCTTCCCCTCTCTTACTTACTCTCTTTTCCCATTCTCCATTTCATTTCCTTGTTTTTCCCCGTTCTTTTCCCTCTTTCCTTTATTTCCCTCCTCCTTCCCCTTTCCCCCCTTTTTTTCCTTTTTTCCTCTCCCCCTTCCT

Barcodes:

@HWI-M01156:58:000000000-A14TV:1:1101:15941:1367 2:N:0:
TTCCGTAGGGTT

Thanks Heaps in advance!

demultiplex next-gen • 11k views
ADD COMMENT
0
Entering edit mode

See this thread: demultiplex a dataset when you have barcodes as a separate fastq

If it is easily feasible ask the sequence provide to demultiplex the data without putting the index read in a separate file (this is standard way of demultiplexing data).

ADD REPLY
3
Entering edit mode
6.0 years ago
Gabriel R. ★ 2.9k

There are different programs to do this but you can try our own deML which does maximum-likelihood demultiplexing:

  deML -i index.txt -f todemultiplex.fq1.gz  -r todemultiplex.fq2.gz -if1 todemultiplex.i1.gz   -o demultiplexed_
ADD COMMENT
0
Entering edit mode

Thank you for that. Good solution to keep on file.

What different programs are you referring to for this specific case (where the index sequences are in a separate file)?

ADD REPLY
0
Entering edit mode

I think Bayexer can do this as well, they compared it to deML in their paper. I am not sure about the input format though but it is another third-party demultiplexer.

ADD REPLY
0
Entering edit mode

I noticed that your solution requires a file with index sequences. It may be cumbersome if one is dealing with molecular tags etc. Can deML be made to work without the index file?

ADD REPLY
0
Entering edit mode

I assume you refer to the files containing the per cluster index. At present time, you would need to dump those in a file or file descriptor but I could modify it easily by adding an option if someone requests it.

ADD REPLY
0
Entering edit mode

Like in this case OP may/may not have a list of all index sequences. As you said the index list could be extracted from the I* file but making the program independent of that requirement may make it friendlier to use.

ADD REPLY
0
Entering edit mode

I am slightly confused, are you referring to the per-cluster index sequence or the list of the all potential index sequences?

ADD REPLY
0
Entering edit mode

Potential.

I assume -i index.txt refers to a list of known indexes?

ADD REPLY
0
Entering edit mode

correct. How deML does is compute the probability that each sequence/cluster pertains to a given read group. But it needs a list of potential samples with their associated indices.

ADD REPLY
0
Entering edit mode

Hi, Thanks everyone for your answers, these are old data unfortunately that we've got from a collaborator so I don't think that the demultiplexed data is available and I don't have an index file but I'd be interested in hearing more about how you could extract an index file from the files I've already got though? Thanks everyone!

ADD REPLY
0
Entering edit mode

I mean you could get a frequency of the different sequences in the index file but how will you know what sample they pertain to?

ADD REPLY
0
Entering edit mode

You can extract unique indexes from the I* file by doing this:

 zcat File_I1_001.fastq.gz | grep -B 1 "+$" --no-group-separator | grep -v "+$" | sort | uniq > index.txt

As @Gabriel said, you would need to know the mapping information to make use of the demultiplexed data.

Sample 1 <--> Index 1
Sample 2 <--> Index 2

etc.

ADD REPLY
0
Entering edit mode

Hi, So I have mapping information however unfortunately this only connects the sample identifier to the relevant metadata and doesn't contain any information such as the barcode/linker sequences etc. I also only have some of the identifiers in this file as while I have the raw read data from several sequencing runs only some of the samples on these runs are relevant for my analysis (so the remaining identifiers are with held for data confidentiality reasons).

@genomax, thanks a bunch for the script, I've been running the code like this:

zcat s_1_1_sequence.fastq.tar.gz | grep -B 1 "+$" --no-group-seperator | grep -v "+$" | sort | uniq > index.txt

unfortunately I keep getting the error

grep: unrecognized option '--no-group-seperator'
Usage: grep [OPTION]... PATTERN [FILE]...
Try: 'grep--help' for more information

Thanks everyone for all of your help!

ADD REPLY
0
Entering edit mode

Try this command instead then. Make sure you are running it using the file containing the index sequences (short reads). That appears to be file 2. Your file also appears to be tarred and gzipped. so you may be best off uncompressing the file. You could pipe these things but let us do it the dumb way.

tar -zxvf s_1_1_sequence.fastq.tar.gz

cat s_1_1_sequence.fastq | grep -B 1 "+$" | grep -v "+$" | grep -v "\-\-"| sort | uniq > index.txt
ADD REPLY
0
Entering edit mode

Thanks for all your help genomax, I've tried running that code. I'm curious how you could determine that file 2 is the file which contains the index sequences, I'm not sure myself so I ran it on all of the files just to check the output. It seems to be isolating the sequences but not then associating them with the relevant sample ID (see below for the first line from each file). Can you think of a reason this might be occuring? Thanks for all of your help with this so far, your a legend!

Code run on first file (read file 1)

AAAAAAAAAAAACAGGAGAAGGAAAGCGAGGGTATCCTACAAAGTCCAGCGTACCATAAACGCAAGCCTCAACGAAGCGACGAGCAGGAGAGCGGTCAGGAGAAATGCAAACGAGGTGACCCGGCAGAAAATCGAAATAACCGTCGGTTAAATCCAAAACGTCAGAAGCCGTAAAGAGCATAAAAGAGCCGAAAGCGGGCGGGAAACGAACGGGAGGTGCAGAAACTTGTCCCATGTGACCTAACCGACAA

Code run on the second file (read file 2)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

code run on barcodes file

AAAAAAAAAAAA
AAAAAAAAAAAC
ADD REPLY
1
Entering edit mode

In Illumina technology order of reads is as follows (Index reads present only when sample is single- or dual-indexed):

Read 1 --> Index 1 --> Index 2 --> Read 2

Corresponding fastq files get named as (unless changed):

R1 --> I1 --> I2 --> R2  (or they could also be named R1, R2, R3, R4)

You are going to find many non-sense index sequences (there are always errors and other things one can't explain) that you would need to weed through.

You would then take the indexes you are interested in put them in the index.txt file and then use @Gabriel's tool

AAAAAAAAAAAA
AAAAAAAAAAAC

As we have already said you need to have the metadata for the sample-index association since there is no way to recover that from this analysis.

ADD REPLY
0
Entering edit mode

Hi @Gabriel R.,

Thank you so much for providing the program ! I am actually using your program to try to demultiplex my Miseq data which contains 2 reads and 1 index (same as above). I tried to create an index file as @genomax described below, but I got this error when I ran it: Error: line AAAAA...AAC does not have 2 or 3 fields.

Could you help me elaborate this error and guide the way to fix it ? Thank you so much !

ADD REPLY
0
Entering edit mode

could you post the offending line? Put 4 whitespaces before the line to keep the format.

ADD REPLY

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6