Biostar Beta. Not for public use.
Question: Demultiplexing SRA data
0
Entering edit mode

Hello,

I am looking at this experiment

Looks like the authors uploaded a multiplexed dataset with no info on barcoding. Any way to guess the barcodes? Just by running fastqc you can sorta guess what some barcodes may be, but is there a better way?

I downloaded the sra file and used --split-3, which gave me 2 files for PE.

Here is what the file sorta looks like:

@SRR2046220.999996 HISEQ:108:C24D5ACXX:4:1101:5712:37823 length=102
AAGGATGCAATTCCTGGTGGTGCCATGGAGGTAAAGTCATAGTATTTTTATGATTTATATTTACATATTTTTACACTTCATAGTCATTTTTATAAAACTTTN
+SRR2046220.999996 HISEQ:108:C24D5ACXX:4:1101:5712:37823 length=102
CCCFFFFFHHHHHJJJIFGGFHIIHIIJAHIEGHJHFGGFIIBGIIJJJJJIJJIJJJJIJJJJJJIJIIJJJIIHHIIJIHHHFHHEFFFFFFEDDECEE#
@SRR2046220.999997 HISEQ:108:C24D5ACXX:4:1101:5536:37823 length=102
CAGGACGAAAATGAAGGTTTGGTTTTAACATTTGATCTGAGTTTATAGTATAGAAAGAGATCTATATTGACTCAGCTTTGCATATAAATCATACATTCTAGN
+SRR2046220.999997 HISEQ:108:C24D5ACXX:4:1101:5536:37823 length=102
######################################################################################################
@SRR2046220.999998 HISEQ:108:C24D5ACXX:4:1101:5653:37824 length=102
TAACTCTCTATTCACGAAAATCTGATCAATTGGATGACGGCTCGAAGAGCTTGATTCTACCAGATAGTACAGTTACATCAGGATGAAGTGCAGAAACGCTTN
+SRR2046220.999998 HISEQ:108:C24D5ACXX:4:1101:5653:37824 length=102
0;8@##################################################################################################
@SRR2046220.999999 HISEQ:108:C24D5ACXX:4:1101:5739:37825 length=102
CTCAATTCAATTCGGAGCTTCGTCCCCTACAGGACCTCACCCTTCGATCAAACTAAATTATTATTCTTTTTCCAATATTACAATATCAACAATATGTACGTN
+SRR2046220.999999 HISEQ:108:C24D5ACXX:4:1101:5739:37825 length=102
1++44=BDF>FFFEEG1CFCGIDGHFH@G>FGEHDGGIIIIID;;FFHEG8@FG@AGHH;CAEFFEHHHEEEDDE>CCEDCA5>>CDEC?@?CCCC>@CB<#
@SRR2046220.1000000 HISEQ:108:C24D5ACXX:4:1101:5569:37827 length=102
TCTCTTACAATTCCAAAAGATATAGATAAGGCAATTTATTGGTATGAAGAATCTGCTAAACAAGGAAATCAAGGTGCACAAAATAGTTTAGAAGGACTTCAN
+SRR2046220.1000000 HISEQ:108:C24D5ACXX:4:1101:5569:37827 length=102
+++22?@A+?CCCCBBBCBCABBBBBCBBBBCCBBBBBBBBBABABBBBBBBBBBBBBBBBBBBBBABBBBABB>=ABAAAA<>>@?@@B>@@@@==;???#
ADD COMMENTlink 4.0 years ago Adrian Pelin ♦ 2.3k • updated 17 months ago RamRS 21k
Entering edit mode
0

Based on the SRA entry this is ddRAD-Seq data. ENA also has just paired end fastq files. This must be EcoRI-MspI digested fragments.

ADD REPLYlink 4.0 years ago
genomax
68k
• updated 17 months ago
RamRS
21k
Entering edit mode
0

How to detect if a certain SRA RNA-seq fastq file has been demultiplexed or not?

ADD REPLYlink 9 months ago
Arindam Ghosh
• 160
Entering edit mode
1

Check the index sequences in fastq headers. If there are more than one the file is likely not demultiplexed.

ADD REPLYlink 9 months ago
genomax
68k
1
Entering edit mode

I have a semi empirical "method" cut the first N bases from the first 100,000K sequences then see how many unique you get. Something like (untested just typed it in)

cat data.fq | head -100000 |\
    bioawk -c fastx ' { print substr($seq,1,8) }' | sort |\
    uniq -c | sort -rn

Then keep raising the 8 to 9, 10 etc as long as you read the indexed adapter you'll have clear groups with approximately the same number of sequences.

ADD COMMENTlink 4.0 years ago Istvan Albert 80k • updated 17 months ago RamRS 21k
Entering edit mode
0

Very very cool idea, love it! I tried on 1 million

So, at 8 got:

1490 TCAATATC
1376 TAAAAAAA
1351 GGCATATC
1144 TGATCGCC
1063 GGACTCAC
1055 GGATATAC
1026 GGCAAGGC
1006 GGCCATCC

Low complexity like TAAAAAA is probably a false one.

At 9:

1465 TCAATATCAA
1347 GGCATATCAA
1130 TGATCGCCAA
1039 GGATATACAA
 996 GGCAAGGCAA
 984 GGCCATCCAA
 965 GGACTCACAA
 907 GAACTTGCAA

Interesting that all end in AA.

At 15:

457 TAANNNNNNNNNNNN
187 TCAATATCAATTCAA
182 TCAATATCAATTCTT
166 GGCATATCAATTCCT
157 TCAATATCAATTCAT
149 TGATCGCCAATTCAA
143 GGCATATCAATTCAA
142 GGCAAGGCAATTCAA
134 GGCCATCCAATTCAA
133 GGATATACAATTCAA
128 GGATATACAATTCTT

Is everything before AA the index sequence?

ADD REPLYlink 4.0 years ago
Adrian Pelin
♦ 2.3k
• updated 17 months ago
RamRS
21k
Entering edit mode
0

normally there should be a big drop in the groupings, but make sure to use it on the first file, only that will have the index, the other pair won't

ADD REPLYlink 4.0 years ago
Istvan Albert
80k
Entering edit mode
0

Yeah I am using it on _1 from sra. Will try on _2. A big drop in grouping would make sense, but I don't think I see it. It may be that the file was demultiplexed, barcodes removed, and then all groups merged back together? That hardly makes sense but maybe possible.

ADD REPLYlink 4.0 years ago
Adrian Pelin
♦ 2.3k
Entering edit mode
0

yeah, your data does not seem to indicate the presence of common sequences at the start.

ADD REPLYlink 4.0 years ago
Istvan Albert
80k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0