Biostar Beta. Not for public use.
Question: random sampling genome.fa
0
Entering edit mode

I have a genome of about 2 gb composed by scaffolds I would random sample the genome.

I used reformat.sh but the output was only a scaffold. I need 1/3 of the total genome... I report only some scaffolds as example:

>LGKD01000001.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold4_contig_1, whole genome shotgun sequence
GAACAGCATGAATGTTAAAACtgaaatggatgatgatgatgatgatgatgatgatgatggcagcaacAGCCatgattatatttaatatgttgttagttataatcataataatgatgataatgttgataacaaTAATGGTTGCAATAATG
>KQ415657.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 unplaced genomic scaffold Scaffold5, whole genome shotgun sequence
tatatatatatagtcaattcgagGATGTTAGATCGACAATGGGGATTATAGAATCCCACAAAAAATTCCACTGGT
>LGKD01000032.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold12_contig_1, whole genome shotgun sequence
GAAGTGGTAAAGAGTgcgatgcgctgaaaaaagagagaacagtacttgaaatGTGGTTTCATTCTagtagtaaat
>LGKD01000033.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold16_contig_1, whole genome shotgun sequence
ctgaTCAACAGAatagggccaatcattcttcatgacaatgctcgaccacacgttttaCTAATGA
>LGKD01000034.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold22_contig_1, whole genome shotgun sequence
TTATCTATATACGagaatattatctatatataaaggaataccaaaaaaacaagaacaacgggtcattcggaattttcttt

There is a script able to do this?

ADD COMMENTlink 10 months ago frida • 10 • updated 10 months ago manuel.belmadani • 830
Entering edit mode
0

You would need to collapse the entire genome sequence into one long fasta and then sample if you truly need 1/3 of total.

Edit: Can you provide exact command line you used?

ADD REPLYlink 10 months ago
genomax
68k
Entering edit mode
0
reformat.sh in=my file.fa ou=newfile.fa samplerate=0.3
ADD REPLYlink 10 months ago
frida
• 10
• updated 10 months ago
h.mon
25k
0
Entering edit mode

Something like this would do:

$ NSAMPLE=2 # Number of sequences you want
$ cat genome.fa  | tr '\n' '\t' | tr '>' '\n' | grep -v "^$" | sort -R | head -n$NSAMPLE | sed -e 's|^|>|g'  | tr '\t' '\n' | grep -v "^$" 
>LGKD01000001.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold4_contig_1, whole genome shotgun sequence
GAACAGCATGAATGTTAAAACtgaaatggatgatgatgatgatgatgatgatgatgatggcagcaacAGCCatgattatatttaatatgttgttagttataatcataataatgatgataatgttgataacaaTAATGGTTGCAATAATG
>LGKD01000034.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold22_contig_1, whole genome shotgun sequence
TTATCTATATACGagaatattatctatatataaaggaataccaaaaaaacaagaacaacgggtcattcggaattttcttt

The idea is to convert every sequence as one line including the header and the sequence itself, separated by tabs. sort -R gives you a random order, and head -n returns a certain number of sequences (i.e. samples since your input is shuffled) and the rest just converts it back to the format it had before (convert tabs back to new lines.) You can figure out the NSAMPLES to get 1/3 of the total by doing grep -c ">" genome.fa and dividing that number by 3.

ADD COMMENTlink 10 months ago manuel.belmadani • 830
Entering edit mode
0

I have a feeling OP is not looking to get a third of fasta sequence entries but rather a third of the actual sequence. It is kind of an odd request if that is true. Perhaps OP will clarify.

ADD REPLYlink 10 months ago
genomax
68k
Entering edit mode
0

Oh interesting. Possible solution for something like that then:

# Randomly get 1/3 of each non-header sequence
awk ' BEGIN{ srand(); }; /^[^>]/ {third=(length($1)/3); offset=(rand() * (third * 2) ); $1=substr($1,offset, offset + third) } 1' genome.fa
ADD REPLYlink 10 months ago
manuel.belmadani
• 830
Entering edit mode
0

Yes, I need only a part of the fasta file. Infact I'm using RepeatScout to find ripetitive elemnts, but the genome is 2 gb and it doesn't work with such a big file. So, the idea is to obtain only a part of the genome because ripetitive sequences are interspersed in the genome

ADD REPLYlink 10 months ago
frida
• 10
Entering edit mode
0

Yes, I need only a part of the fasta file.

I don't think that is right. You seem to have a multi-fasta file so you need as many entries as needed to make roughly third of total (e.g. if you had 15 fasta entries then you would make 5 x 3 files). That is going to be complicated by fact that the sequences are not of equal length, correct?

Why not approximately cut the file into three (or how many ever) pieces?

ADD REPLYlink 10 months ago
genomax
68k
Entering edit mode
0

Hi genomax, I'm the user fripeki but I could not post anymore with that account (reaching max limit of 5 post for day). I Appreciated your idea. Can you suggest me a command to cut the file into 3 approximately equal parts? Sorry I'm learning bioinformatic so I don't know all the solutions

ADD REPLYlink 10 months ago
SeaStar
• 10
Entering edit mode
1

faSplit utility from Jim Kent will also work (linux version linked, do chmod a+x faSplit after downloading). This will be a clean solution compared to split but it may not make the files exactly equal in size (depending on what you have in your files). Change number 3 to desired pieces.

$ ./faSplit sequence genome.fa 3 splt_

this will generate files

splt_0.fa
splt_1.fa
splt_2.fa
ADD REPLYlink 10 months ago
genomax
68k
Entering edit mode
0

ok! thank you very much for your help!

ADD REPLYlink 10 months ago
frida
• 10
Entering edit mode
0

You can use the unix command split to split the files. Adjust the number 3 to suit your purpose.

# Following will split the file into three equal pieces. Split files will get names with prefix (spl_)

$ split -n 3 genome.fa spl_

# this will result in three equal size pieces

spl_aa 
spl_ab
spl_ac

This will most likely split the file in a way that the second and third files will not start with a valid fasta header. You can look at the file 2 and 3 by

$ head -3 splt_ab

to confirm this.

You can insert a valid fasta header by doing the following for file 2 (and 3 etc)

$ sed -i '1 i\>begin_file2' spl_ab
$ sed -i '1 i\>begin_file3' spl_ac
ADD REPLYlink 10 months ago
genomax
68k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0