Biostar Beta. Not for public use.
How to see how many times a sequence of ~7000 bp occurs in long reads?
1
Entering edit mode
19 months ago
bas1993 • 10

So I have pacbio and MinION reads and I want to know if a genetically modified region is 1 time present or multiple times. Normal mapping doesn't give me the answers that I want as the reads are soft-clipped on the known sequence so I have no information about what else is in the read. So instead of using the known sequence as reference and the reads as query I was thinking about using the reads in a multi-fasta file as a reference and use the known sequence as query. Which tool is the best for this? a local BLAST? Or am I thinking in a wrong way.

1
Entering edit mode

minimap2 should be the aligner of choice in this case. Is that what you already used?

You could do it with BLAST but depending on what kind of local alignments you see (if there are smaller repeat regions) you may have to play with BLAST settings to get the result you need. If you expect the 7kb sequence (or parts of it) to be present as is in the reads then even blat may be useful to find them.

0
Entering edit mode

But will I not get the same problem as with other mappers. I know the genetic modification is present in the reads, but I would like to know if it is present multiple times (or 1 time completely and then half of the sequence).

0
Entering edit mode

Do you not need to assemble first to figure this out? How will you (easily) detect multiple genetically modified versions when you will already get many, many reads mapping?

0
Entering edit mode

I already did an assembly and in the .gfa files it looked like circular contig that is connected to the chromosome. So I hypothised that maybe the assembly was influenced by some of the reads that are shorter than the modification. That is why I extracted all the reads that are longer than the modification and wanted to some kind of alignment.

0
Entering edit mode

I did not do this with a 7 kb sequence but with a test sequence (there are two copies in second read) I see the following.

$more query.fa >test TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATT$ less -S new.fq
TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTA
+
$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+
TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTATAGAGATCGGTGGCTGGTTCAGTTACGTA
+
$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+$%''&'''&''('%',3.32./6/%$%('


Using the following, I see the two copies of query sequence

$minimap2 -ax map-ont query.fa new.fq @SQ SN:test LN:77 @PG ID:minimap2 PN:minimap2 VN:2.15-r905 CL:minimap2 -ax map-ont query.fa new.fq a0ab7a02-b816-4921-ae1a-61fe262cb994 0 test 1 53 77M134S * 0 0 TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTA$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+ NM:i:0 ms:i:154 AS:i:154 nn:i:0 tp:A:P cm:i:10 s1:i:71 s2:i:0 de:f:0 a0ab7a02-b816-4921-bc1a-61fe262cb994 0 test 1 53 211S77M7S * 0 0 TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTATAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGC$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-0 NM:i:0  ms:i:154        AS:i:154        nn:i:0  tp:A:P  cm:i:10s1:i:71  s2:i:0  de:f:0  SA:Z:test,1,+,77M218S,53,0;
a0ab7a02-b816-4921-bc1a-61fe262cb994    2048    test    1       53      77M218H *       0       0       TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATT   $%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&)   NM:i:0  ms:i:154        AS:i:154        nn:i:0  tp:A:P  cm:i:10 s1:i:71 s2:i:0  de:f:0  SA:Z:test,1,+,211S77M7S,53,0;

0
Entering edit mode

So you essentially knocked in a certain sequence which is not part of the genome, you don't know its location and want to know how many integration locations you have?