Biostar Beta. Not for public use.
Question: Split a sequence in a fastq file
0
Entering edit mode

Hi All,

Could you suggest a way to split a read in a fastq file (on a particular motif) and keep the 2 resulting sequences as 2 independent reads?

I'll give an example of what I want to do:

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG TGTGACCTTCAGGACAGTCCTAAGGCTGTGGGAAAAACACTNAAAACATGAGTTCAAAAATATATATATATTTTCCCAACTATGCAAAAATATAAGGATGCAATATGGATTGTATAATGAGCTTCACAGATATAAAGGAACAGNGGCAT +

AAAAJJ77<7JJJ7FAJJJJJJJFFFJF< FFF7AFJJJJFA#JFJJFJJJJ< AA-F-< JJFJAJFAAJ< JJJJJ--<<< -FFFF7AJJJJFFJJAFFFFA<<-7< FFJA< JJJJAJF< AAFF7-F< AF-A7A-< -< J-FFJ<f#ajaa&lt;< p="">

Then grep for a sequence. e.g TATATATATA and cut on that string and keep the 2 resulting as 2 reads:

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG

TGTGACCTTCAGGACAGTCCTAAGGCTGTGGGAAAAACACTNAAAACATGAGTTCAAAAATATATATAT

+

AAAAJJ77<7JJJ7FAJJJJJJJFFFJF< FFF7AFJJJJFA#JFJJFJJJJ< AA-F-< JJFJAJFAAJ< JJJJJ

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG

TTTTCCCAACTATGCAAAAATATAAGGATGCAATATGGATTGTATAATGAGCTTCACAGATATAAAGGAACAGNGGCAT

+

--<<< -FFFF7AJJJJFFJJAFFFFA<<-7< FFJA< JJJJAJF< AAFF7-F< AF-A7A-< -< J-FFJ< F#AJAA<

Thank you

ADD COMMENTlink 15 months ago ste.lu • 40 • updated 15 months ago Pierre Lindenbaum 120k
Entering edit mode
0

I'd suggest writing a biopython script for something like that. Do you have any programming experience?

ADD REPLYlink 15 months ago
WouterDeCoster
39k
Entering edit mode
0

Thank for your answer. I've coded a bit my background is different. What would you suggest? a link to out me on the right track is more than enough.

ADD REPLYlink 15 months ago
ste.lu
• 40
Entering edit mode
1

I'd recommend going through some sections of the Biopython cookbook and tutorial. That would put you on track on how to solve this and further questions about handling common file formats.

While one-liners like the one of Pierre are pretty (and efficient) it would probably take me less time to write it in Python, especially if I have scripts saved from earlier/similar applications which I just have to adapt a bit.

ADD REPLYlink 15 months ago
WouterDeCoster
39k
3
Entering edit mode

linearize, use awk to detect the position of the patern, print the two sequences, convert back to fastq

cat input.fastq |\
paste - - - - |\
awk -F '\t' 'BEGIN{S="TATATATATA";N=length(S);}{i=index($2,S);if(i==0) {print} else {printf("%s\t%s\t+\t%s\n%s\t%s+\t%s\n",$1,substr($2,1,i),substr($4,1,i),$1,substr($2,i+N),substr($4,i+N));}}' |\
tr "\t" "\n"
ADD COMMENTlink 15 months ago Pierre Lindenbaum 120k
Entering edit mode
0

Hi Pierre,

Thanks for your script! In this way I keep all the reads, the original one and the 2 derived, isn't it?

ADD REPLYlink 15 months ago
ste.lu
• 40
Entering edit mode
1

no, you will only get the two substrings as output. But that's what you asked for, no?

ADD REPLYlink 15 months ago
lieven.sterck
5.1k
Entering edit mode
0

yeah, definetly. Thanks!

ADD REPLYlink 15 months ago
ste.lu
• 40
Entering edit mode
0

lovely oneliner Pierre Lindenbaum !

some remarks though: I think the 'motif' is missing in your output (at least that's what I understood from OP's example, to also still include the 'motif' , and there might be an off-by-one mistake in it as well ?

ADD REPLYlink 15 months ago
lieven.sterck
5.1k
Entering edit mode
0

an off-by-one mistake in it as well ?

may be :-D

ADD REPLYlink 15 months ago
Pierre Lindenbaum
120k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0