Biostar Beta. Not for public use.
bcl2fastq2: how to correctly use the --use-bases-mask for different sequencing methods by Illumina ?
0
Entering edit mode
13 months ago
badredda • 100
France/Lille/EGID

Hello,

I need your help to address the parameter found in bcl2fastq2 tool when demultiplexing data generated by Illumina's sequencers. As you know, there are different ways to sequence genomic data but mostly by doing Paired-End (PE) or Single-End (SE) sequencing. Plus, to sequence the data, you have to use single-indexing or double (or dual) indexing on the reads. As per Illumina's definition:

Single and Dual Indexing

The number of index sequences added to samples differs for single-indexed and dual-indexed sequencing.

Single-indexed libraries — Adds up to 48 unique six-base Index 1 (i7) sequences to generate up to 48 uniquely tagged libraries.

Dual-indexed libraries — Adds up to 24 unique eight-base Index 1 (i7) sequences and up to 16 unique eight-base Index 2 (i5) sequences, generating up to 384 uniquely tagged libraries. The IDT for Illumina TruSeq UD Indexes are provided as index pairs and can generate up to 96 uniquely tagged libraries. These indexes add up to 96 unique eight-base Index 1 sequences and up to 96 unique eight-base Index 2 indexes.

During indexed sequencing, the index is sequenced in a separate read, called the Index Read, where a new sequencing primer is annealed. When libraries are dual-indexed, the sequencing run includes two additional reads, called the Index 1 Read and Index 2 Read.

Knowing this, I have two questions:

  1. Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?
  2. How can we demultiplex such data since the file generated by the sequencer (RunInfo.xml) contains configuration for a dual index run ? In other words, demultiplexing lanes that have dual index works fine when providing the RunInfo.xml, but for single index, what should I use for the --use-bases-mask parameter ?

Also, I know that for --use-bases-mask, we can use the following parameters for different types of sequencing:

  • Single-End sequencing: Y * ,I6N *
  • Paired-End sequencing:

    • Dual-Indexing: Y\*,I\*,I\* ,Y\*
    • No Index: Y\*,Y\* (Thanks to Devon Ryan)
    • Single Indexing: Y\*,I6N,Y\* (Thanks to Devon Ryan)
    • In-read barcode in the first read for some of the samples, but the run was PE dual-index: I5Y*,N*,N*,Y* (Thanks to igor)
    • 10x Genomic Single Cell 3' v1 kit: Y98,Y14,I8,Y10 (Thanks to igor)
    • 10x Genomic Single Cell 3' v1 kit + more standard libraries on the same run: Y98N*,Y14N*,I8N*,Y10N* (Thanks to igor)

    Also, could you please state what other types of parameters could be used in different cases ? (for future readers)

Thanks for your time and help. Don't forget to upvote this post please so users can find this post.

ADD COMMENTlink
2
Entering edit mode
12 months ago
Freiburg, Germany
  1. Yes, though bcl2fastq2 won't be able to handle it in a single step. We commonly do this and we then process each flow cell in compatible chunks, using --tiles. As an example, if the first two lanes of a flow cell have compatible indices (both in number and length) then you need --tiles s_1,s_2. You then also need multiple output directories per flow cell.
  2. See above. In short, you use one --use-bases-mask at a time.

Note that unless you have a mixture of either barcode lengths between lanes or barcode strategies (dual vs. single) you don't actually need --use-bases-mask at all.

For PE and no index you would could use --use-bases-mask Y*,Y*, unless you used an index run. For a single index it'd then be Y*,I6N,Y*.

ADD COMMENTlink
0
Entering edit mode

Dear Ryan,

Thanks for your reply. The single index has a 6 base pairs length while the dual index has an 8 and all indexes are differnet from one to another. Let's take this RunInfo.xml as example (uploaded on my Google Drive):

https://drive.google.com/open?id=1EJHnNuTyW8BfDLdE4yoBxp78rw8bYsHF

How can I proceed, knowing that for example, lane 5 and 6 are the single index data ?

Thanks

ADD REPLYlink
1
Entering edit mode

--use-bases-mask Y*,I6nn,nnnnnnnn,Y* in that case.

ADD REPLYlink
1
Entering edit mode

badredda you could use a separate --use-bases-mask for lanes 5 and 6 and then a different one for other lanes.

ADD REPLYlink
1
Entering edit mode

I'm passing for a problem like this one, could you help me?

my RunInfo.xml:

<?xml version="1.0"?>
<RunInfo xmlns:xsd="..." xmlns:xsi="..." Version="4">
  <Run Id="190219_NB500954_0035_AHGMJVAFXY" Number="35">
    <Flowcell>HGMJVAFXY</Flowcell>
    <Instrument>NB500954</Instrument>
    <Date>190219</Date>
    <Reads>
      <Read Number="1" NumCycles="151" IsIndexedRead="N" />
      <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
      <Read Number="3" NumCycles="8" IsIndexedRead="Y" />
      <Read Number="4" NumCycles="151" IsIndexedRead="N" />
    </Reads>
    <FlowcellLayout LaneCount="4" SurfaceCount="2" SwathCount="1" TileCount="12" SectionPerLane="3" Lane
PerSection="2">
      <TileSet TileNamingConvention="FiveDigit">
        <Tiles>
          <Tile>1_11101</Tile>
          <Tile>1_21101</Tile>
          <Tile>1_11102</Tile>
          ...
          <Tile>4_11612</Tile>
          <Tile>4_21612</Tile>
        </Tiles>
      </TileSet>
    </FlowcellLayout>
    <ImageDimensions Width="2592" Height="1944" />
    <ImageChannels>
      <Name>Red</Name>
      <Name>Green</Name>
    </ImageChannels>
  </Run>
</RunInfo>

We normally use a 151x8x8x151 amplicon panel, but we added a single indexed panel with 12 index length, I had tried --use-bases-mask Y*,I12,,Y* but I receive the error above:

2019-02-25 21:45:12 [7faca61f4780] ERROR: bcl2fastq::common::Exception: 2019-Feb-25 21:45:12: Success (0): /tmp/bcl2fastq/bcl2fastq/src/cxx/lib/layout/Layout.cpp(378): Throw in function void bcl2fastq::layout::setIndexReadMetadata(const std::vector<long unsigned int>&, bcl2fastq::layout::ReadMetadata&, size_t)
Dynamic exception type: boost::exception_detail::clone_impl<bcl2fastq::common::InputDataError>
std::exception::what: Barcodes in sample sheet are longer than the index length found in RunInfo.xml.

I have tried to change the RunInfo.xml index values to 12 as:

<Read Number="2" NumCycles="12" IsIndexedRead="Y" />
<Read Number="3" NumCycles="12" IsIndexedRead="Y" />

But my FASTQs were empty, any help?

ADD REPLYlink
4
Entering edit mode

If you only ran 8 bases for the first index, that's all you've got. You can't invent data you don't have by futzing with the command line.

ADD REPLYlink
1
Entering edit mode

What the single indexed panel the only one on the flow cell or was it mixed with normal length indices? Was it actually 12 bases, or did you dual index it with 6 base indices? If the former is the case then only the first 12 bases of the barcode were actually read and it's going to end up in the undetermined indices no matter what you do. You can write a bit of python to retrieve it then.

ADD REPLYlink
0
Entering edit mode

Thank you guys! It was mixed with normal length indices (8 bases), and it was 12 bases on one side. The python algorithm should open the Undetermined FASTQ and search for the reads with the possible index in the header?

ADD REPLYlink
1
Entering edit mode

As @swbarnes2 pointed out above looking at your RunInfo.xml file this run was set up as 151x8x8x151.

<Reads>
  <Read Number="1" NumCycles="151" IsIndexedRead="N" />
  <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
  <Read Number="3" NumCycles="8" IsIndexedRead="Y" />
  <Read Number="4" NumCycles="151" IsIndexedRead="N" />
</Reads>

i.e. with 8 cycles on index 1 and 8 cycles on Index 2. There is NO way to recover data for 12 cycles for Index 1 since those additional 4 cycles were never sequenced.

If 8 bp from Index 1 that were sequenced are discriminatory enough you may be able to recover data but otherwise this run will have to be repeated for the samples with 12 bp indexes.

ADD REPLYlink
0
Entering edit mode

Thanks, genomax. The 8 bp from index 1 were specific enough to recover than, so I just adjusted the sample sheet used. Best regards.

ADD REPLYlink
2
Entering edit mode
22 months ago
igor 7.7k
United States

could you please state what other types of parameters could be used in different cases ?

Any parameters are possible. The parameter specifies how you want to interpret the actual sequencing output. You have to make sure that the number of reads and their lengths matches what was ran.

You can use different --use-bases-mask for different lanes or just provide a sample sheet for the lanes that you are interested in.

There are many odd library options. For a hypothetical example, you may have an in-read barcode in the first read for some of the samples, but the run was PE dual-index. Then you might have: I5Y*,N*,N*,Y* (treat first 5 bases of R1 as index and the rest as actual read, ignore I1 and I2, then treat R2 as normal read).

For a real life example, 10x Genomic Single Cell 3' v1 kit required this: Y98,Y14,I8,Y10. This used the second index read as the bcl2fastq index, but kept the other reads for additional processing with more specialized software (Cell Ranger). If you had other more standard libraries on the same run, you would need to add Ns to ignore additional bases: Y98N*,Y14N*,I8N*,Y10N*.

ADD COMMENTlink
2
Entering edit mode
17 months ago
swbarnes2 5.7k
United States

Is it acceptable to mix single index and dual index on the same flowcell (e.g. Hiseq 4000) knowing that we configured the sequencer as a dual index run ?

Yes. I do this all the time. Without messing with base masking or subsetting by lane/tile.

Did you try it the easy way first?

ADD COMMENTlink
0
Entering edit mode

Does that work now? It used to break bcl2fastq2.

ADD REPLYlink
0
Entering edit mode

I frequently have a mix of samples on one flow cell, some with two indices, some with one. I used to break up into two sample sheets, but I don't now, and it works fine. I can't remember testing having indices of differing lengths, but I think that will work too.

ADD REPLYlink
0
Entering edit mode

Interesting, I wonder when Illumina enabled this, it would seriously simplify my demultiplexing workflow :)

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1