Biostar Beta. Not for public use.
1
Entering edit mode
22 months ago
gbl1 • 70

Hello,

I have a lot of reads from an individual, and let's say I am very interested in a short sequence such as "ATCGATCG" I would like to design a primer from my reads that ends by it. NNNNNNNNNNNNATCGATCG and NNNNNNNNNNNNCGATCGAT.

How to find out what is the most present sequence in my read to design such a primer?

genome assembly • 421 views
0
Entering edit mode

As you tagged, you should assemble the genome or target sequence region firstly, before designing primers.

Actually, I have no idea about what you want.

0
Entering edit mode

Why not just order a library of pseudo-degenerate primers from your supplier? You can order primers which have the sequence nnnnnnnnATCGATCG, which will amplify throughout the genome, if that's what you want.

It's not very clear at the moment what the purpose is, but doing this from reads seems likely to lead to an enormous number of primer sequences - I agree with assembly.

0
Entering edit mode

Hi jrj.healey and shenwei356

If I had enough reads to assemble the genome, I would not be in need of having more reads or design any primer for it… If the genome was assemblable, I'll do a simple in silico study.

As I tried to patent warm water, and was told someone already published it, let me explain it better:

8 bases oligo should be found in many place in the reads, from all the occurences, there's most likely one in which one of the base before is proheminent

of those 9 bases oligo, , from all the occurences, there's most likely one in which one of the base is proheminent …

bla bla bla until 20 base

degenarate is not a good option, for there's too many solution at the end and I would get too many bullshit if I get a primer with 16 degenerated bases … and to do that, I would not need any reads in first instance…

0
Entering edit mode

It sounds like you are trying to close gaps in your genome in that case?

For what its worth, if your data is so poor that assembly doesn't yield any improvement, you have bigger problems and should start over.

This seems a lot like an XY problem to me, so perhaps explain what you're actually trying to investigate?

0
Entering edit mode

Idea might be to take a sequence signal that is specific of coding region, then having some primer that might would fit them with S3AP… for basically, I'm working with plants, octoploid species, a normal illumina read of the genome cover just one third, half of the genome is gypsy and 10% copia… so having an idea of what some coding region could be might help in making it more assemblable. please consider it is an orphan crop, and that I would never have enough budget to get enough random reads to make an assembly, but some targeted reads could help in completed it.

0
Entering edit mode

You should still attempt to assemble your reads, as you do not know where the gaps are going to be. You may end up designing primers unnecessarily.

I think you're underestimating how many primers this approach could yield.

Am I to understand that you are trying to identify a specific gene (or partial gene) within your reads?

0
Entering edit mode

isn't this in principle looking for "ATCGATCG" motif as all the previous bases can be any base? If so, you would have lots of amplification. @ gbl1

0
Entering edit mode

So, the "ATCGATCG" might appear either on a random way, or in a more or less conserved sequence. I try to target the "conserved" sequence, and then need to figure out what it is…

0
Entering edit mode

I really don’t think your approach is going to work the way you think it will, nor will it solve the problem as you expect...

0
Entering edit mode

You have no idea what "the problem" is… Neither if I actually have a problem…

2
Entering edit mode

I understand this might be a language barrier, but this thread is dangerously close to becoming personal. Let's keep the discussion to just the science please.

0
Entering edit mode

Sorry for sounding rude, but I sound I need to explain a point again, even if I have already told it in this tread:

• I don't try to assemble the genome
• I don't have enough read to assemble the genome
• I cannot afford enough read to assemble the genome
• I don't need to assemble the genome for my experimental purpose

basically, there's 2 possible ways to get a satisfying result:

For all ATCGATCG sequence, the previous base is more often, let's says G
For all GATCGATCG sequence, the previous base is more often, bla bla bla…

OR

in For ATCGATCG sequences the proportion of each bases at each position from it are …

Even if, when I was a kid, when I asked to my dad why such a thing does not exist, he answered "because Einstein left you things to discover" I sure someone tried such a thing before.

0
Entering edit mode

Do you have reads in fastq, fasta or bam format? Do you only want one primer, or a list of sequences and how often they occur? How long should this primer be?

0
Entering edit mode

I think the original format is fastq, but I also have them in fasta, and even in a blast DB

Primers should be around 22 nucleotides

I would like one primer each side (it may have some degenerations, but not 16 N ;)

The "CGATCGAT" exemple was a random one, there's several short sequences I might look for!

0
Entering edit mode

Hi,

My message box gave me an answer:

18 hours, 15 minutes ago: genomax wrote on C: Designing a primer from reads : Only way you are going to be able to do this is to align all reads just at ATCGATCG (ignoring bases around this sequence in alignment). Then parse the columnar positions in that alignment to get the information you want.

That indeed look like a good way of doing it… I can't find it in the tread, and don't know how to PM someone here… So can anyone ask genomax what is the best way of doing that? Advising me a soft that would do that?

0
Entering edit mode

0
Entering edit mode

Sorry, I though I was commenting…

0
Entering edit mode

You can join the biostars slack to pm users

0
Entering edit mode

gbl1 : I had added that comment yesterday but deleted it later since I was only outlining logic one could use to do what you wanted. I don't recollect any software off the top of my head that would allow you to align a specific chunk of sequence in all reads ignoring sequence around that core but it certainly may exist out there.

This can also be done by writing a custom program which can create a dictionary of bases in positions around that core sequence which you can then use (only in cases where ATCGATCG is more or less in the middle edit: I see now that in original question this sequnce needs to be at the end) to get the info you want.

0
Entering edit mode

At least you understood the logic I was seeking for ;) For sure it might exist… I can't code that myself… I'm not that good in programming, I'm mostly a molecular biologist…

0
Entering edit mode

How many reads do you have? You could take a sample of a couple hundred and try doing this in MEGA. You may have to play with gap opening penalties (and other settings) to see if you can force the program from opening too many gaps when doing the alignment. You may be able to highlight the sequence you want and say don't put any gaps here.

Note: Use your fasta files with MEGA.

0
Entering edit mode

I have got from memory 30Gb of reads, this cover 1/3 of unique region

I'll look at Mega ;) thanks for the hint

0
Entering edit mode

I think you should look at @Wouter's answer below and not worry about MEGA (with 30G of reads). We are not sure what it is you are trying to achieve but after trying the following you can tell us if the solution is going in the right direction.

2
Entering edit mode
17 months ago
Belgium

I believe the code below currently answers the question. I am not sure if it is technically or biologically sane. I don't know how you got these patterns and how exactly making primers from these is going to help you solve your assembly, but that's not my job ;-)

What happens:

1. Reads are searched in a fastq.gz [--fastq] file with an exact match to (a) given pattern(s) [--pattern]
2. If pattern is found a sequence of length 22 is extracted (adjustable parameter in the script)
3. All those sequences which are found are counted and sorted, reporting the 20 most frequent strings (adjustable parameter in the script). Results are printed to stdout.

What does not happen:

• Looking at reverse complement or strand
• Allowing nucleotide mismatches
• Incorporating degenerate nucleotides

This script was only limited tested on a small dataset. I don't know about the size of your dataset and the RAM you have available. You will need python and biopython. Save the code as pattern2primer.py or something like that and execute as e.g. python pattern2primer.py --fastq myreads.fastq.gz --pattern ATCGAGAAG

Let me know if you have additional specifications.

0
Entering edit mode

looks lovely! I'm going to try this as soon as possible!!!

0
Entering edit mode

Hi,

So, I tried on my first set of reads, it has worked perfectly... I will need to sort that out, but it is great!

I tried on my second set, I had a bug:

Traceback (most recent call last):
File "pattern2primer.py", line 33, in <module>
main()
File "pattern2primer.py", line 11, in main
for read in SeqIO.parse(gzip.open(args.fastq, 'rt'), "fastq"):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 582, in parse
for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/QualityIO.py", line 912, in FastqGeneralIterator
File "/usr/lib/python2.7/gzip.py", line 455, in readline
File "/usr/lib/python2.7/gzip.py", line 261, in read
File "/usr/lib/python2.7/gzip.py", line 308, in _read
File "/usr/lib/python2.7/gzip.py", line 347, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0x485243c5 != 0x75682b8aL
beleduc@dx3-biotek67:~\$