Assume a contig of draft genome assembly that contains several regions of unknown sequence and length, which are indicated by stretches of 'N'. Further assume the presence of a contig of the same genome region, whose source is a complete and trustworthy genome assembly of the same organism. Finally, assume that the contig of the draft genome and the contig of the final genome have been properly aligned (i.e., positional homology is established).
The result would be something like this:
$ cat alignm.fas >contig_draft ACGTNNNNNNNNNNACGTNNNNNNNNNNACGT >contig_final ACGTACGT------ACGT------ACGTACGT
I would like to remove those 'N' from the draft contig that are matched by a gap character in the alignment (n=12 in the above example), but retain those that are matched by a nucleotide (n=8 in the above example).
In the above example, the desired output would be a new draft contig sequence with a length of 20 nucleotides, 8 of which are 'N'.
$ cat alignm.fas | commands_sought >contig_draft ACGTNNNNACGTNNNNACGT
Is there any code (e.g., on GitHub) already available to conduct the above-mentioned procedure?