How to generate a new FASTA from an assembly-assembly mapping ?
1
0
Entering edit mode
6.2 years ago

Hi everyone,

As you may know, the NCBI offer a "assembly-assembly remapping service" which basically map contigs of one genome (let's call it Genome 1) to the assembly of another genome (lets call it GenomeRef) My purpose is to generate a new file (FASTA) which consists on all the sequences of my mapped contigs (in the right order depending on the mapping) and replace the GenomeRef positions which have not been mapped by N .

I downloaded the GFF file provided on the NCBI website :

Example of a line :

Contig1 RefSeq match 49000 49020 . + . ID=xxxxxxxxxxxx;Target=GenomeRefContig5 6467734 6467754 +;best_on_query=1;best_on_query_same_unit=1;best_on_subject=1;best_on_subject_same_unit=1;gap_count=0;genomic_to_genomic=1;num_ident=21;num_mismatch=0;pct_coverage=0.000282069;pct_coverage_hiqual=0.000282069;pct_ident_quantized=98;pct_identity_gap=100;pct_identity_gapopen_only=100;pct_identity_ungap=100;reciprocity=3;same_unit_reciprocity=3

The problems i am facing while doing it are :

  • The GFF file is not sorted depending on the GenomeRef sequences but on the Genome 1 contigs

  • I think i would be able to use samtools faidx and a for loop to retrieve all my contigs sequences from the Genome 1 fasta file but how to replace gaps between those contigs by N depending on the gap length of the GenomeRef ?

[There will be a lot of gaps since the Genome 1 is a very poorly sequenced genome while the GenomeRef is a greatly sequenced genome (PacBio)]

I hope it is clear,

Thanks a lot !

assembly sequencing mapping samtools GFF • 1.9k views
ADD COMMENT
0
Entering edit mode

I have already been able to do ->

Remove the extra informations i don't want :

cut -f1,4,5,9 -d$'\t'  file.gff | sed 's/ID.*Target=//g' | sed 's/-.*\|+.*//g'  > simplified.gff

Sort the simplified file depending on the Genome ref scaffold and remove headers :

sort -k4 simplified_gff > final.gff

sed 's/#!gff//g' final_gff.gff | sed 's/##gff//g' | sed 's/#!processor NCBI annotwriter//g' >  MyGFF.gff

sed -i '/^$/d'  MyGFF.gff

Then i seperated every informations in separated files :

while read line ; do echo "$line" | cut -f1 ; done < MyGFF.gff >> line1                       #Name of Genome 1 contig

while read line ; do echo "$line" | cut -f2 ; done < MyGFF.gff >> line2                       #Start pos of genome 1 contig

while read line ; do echo "$line" | cut -f3 ; done < MyGFF.gff >> line3                       #End pos of genome 1 contig

while read line ; do echo "$line" | cut -f4 ; done < MyGFF.gff >> ligneINTER

while read line ; do echo "$line" | cut -f1 -d " " ; done < ligneINTER >> line4               #Name of GenomeRef scaffold

while read line ; do echo "$line" | cut -f2 -d " " ; done < ligneINTER >> line5               #Start pos of GenomeRef scaffold

while read line ; do echo "$line" | cut -f3 -d " " ; done < ligneINTER >> line6               #End pos of GenomeRef scaffold

Then for each line, i can make a samtools to catch the sequence of my Genome1 and add it to a file that have the GenomeRef scaffold as name :

####wc -l file.gff = X
####END = X

for i in $(seq 1 $END) ; do samtools faidx Genome1.fasta `sed "${i}q;d" line1`:`sed "${i}q;d" line2`-`sed "${i}q;d" line3` >> `sed "${i}q;d" line4`.fasta ;done

But i really don't see how i can add the N's ....

ADD REPLY
1
Entering edit mode
6.2 years ago
tdmurphy ▴ 190

On the NCBI Remap FTP site, the alignment files for each assembly pair are provided twice (assm1/assm2 and assm2/assm1). The alignments themselves are flip-flopped for query and target, which would take care of most of your sorting problem.

ADD COMMENT

Login before adding your answer.

Traffic: 4004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6