Hi everyone!
The dataset that I am working with comes from a producer cell line that expresses a recombinant protein. Therefore, I was wondering how I could modify the reference .fasta and .gtf files in order to include the sequence of the recombinant molecule. Are there any convenient functions/packages for that job in either python/R/terminal? Is there anything that I need to be careful about? A potential candidate that I have found is reform1; does anyone have experience with this package?
Where does OP say that? I don't know industrial terminology but is that implied by the following sentence?
Not at all, that's why I made it a precondition. There's at least one method that allow for that and the company selling it became pretty popular in the industry.
In case OP doesn't know where the production vector inserted precisely, I don't know how useful the efforts are, but that's up to OP to decide. AFAIK companies prepare for future questions by regulatory authorities with regards to proof cell line stability was unaffected, so they would need to know whether a gene was disrupted and if so which gene and where.
However, I guess I extrapolated too much from the familiar terms which don't pop up so often here - all of my answer applies to DNAseq based methods. I don't know how OP determined the insertion site based on RNAseq.
Thanks for your detailed reply. As you guessed I have not confirmed the insertion site of the sequence. However, from this conversation discussion 1 I understand that it should be ok to add the sequence as an extra chromosome. What do you think? I have to admit that I am having major problems with editing the .gtf file to add the extra line as suggested in the link that I have attached. I cannot find an editor that would open the file. I will also definitely check the Staden package.
As additional chromosome, you don't need to modify the existing sequence, so I don't see a need for Staden or reform (btw, the link is barely visible in your original post).
You can append your sequence to the fasta sequence you use and the annotation lines to the gtf/gff. Just make sure the sequence identifiers match. You don't need to recalculate any coordinates, as you keep it separate.
Thanks, it's good to know that it can be done this way. I am currently getting some errors in the FeatureCounts step so I might be back with more questions...
GTF file should be plain text so as long as you use an editor like (NotePad++ or similar) you should be able to edit the file. Be sure to keep the proper format (columns, tab separators etc) when you add the extra entry for "insert".
Yes, indeed, it was easier than I thought when I used
vim
to append the new lines.