Entering edit mode
7.3 years ago
saranpons3
▴
70
Hello All,
In the file complete genome file of fruit fly (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz) , I could see in many places, A,T,C,G are in lower case. The reason given in the README.txt is Repetitive sequences are in eukaryotes are masked to lower-case. If I want to generate random/simulated reads from this complete genome, should i convert the lowercase a,t,c,g to uppercase A,T,C,G before simulation? Thanks in advance.
Lower case letters can indicate low-confidence calls or masked bases. Sounds like in this case they are masked due to being repetitive. You will most closely approximate real data by converting them to upper case before generating simulated reads.
Hello Brian, Thanks for the answer.
"You will most closely approximate real data by converting them to upper case before generating simulated reads". Does this statement mean that i can convert lower case a,t,c,g to uppercase A,T,C,G before generating simulation reads?
Can you tell me, in practice, what would be the sequencing depth for sequencing the human genome?
Thanks in advance.
Yes.
It can be any depth, but often 30x is targeted for variant-calling.
What about for denovo assembly?
Again, it varies. I recommend at a minimum 30x per ploidy, and preferably at least 100x per ploidy, so 200x for a diploid, for real unamplified 2x150bp Illumina reads. With simulated data you don't need as much because there's no bias, so the coverage is uniform. Also, the longer the reads are, the less coverage you generally need. Typically when people are assembling a complicated genome like human purely from Illumina reads, they use multiple libraries with different insert sizes - mostly short insert (potentially overlapping reads) with a smaller amount of coverage (say, 5x) of long-mate-pair libraries for scaffolding. These days there are other options like 10X (meaning 10X Genomics the company, not "10-fold coverage") linked reads and PacBio very long reads, which can have different coverage requirements and assembly approaches. You can, incidentally, simulate PacBio data with the BBMap package, but I don't know of any existing 10X simulators.
thanks for the reply
Lowercase Characters In Sequenced Genomes
thanks for the reply