Question

Help - how to merge human reference genome and GTF file with a custom sequence.

0

Entering edit mode

4 months ago

Dyno • 0

Hello Biostars,

I am looking for some guidance on how to merge some files for my rna-bulk sequencing analysis. Let me start by describing the problem:

I received an mRNA sequence of 4775 characters which I would like to merge with the human reference genome that I download from NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/). I downloaded all of the files, but for this I am sure that we only consider Genome sequences (FASTA) and Annotation features (GTF) because these are the only files I use for my mapping (HISAT2) and quantification (featureCounts/subread). Unzipping the downloaded files I get GCF_00001405.40_GRCh38.p14_genomic.fna and genomic.gtf.

To merge the reference genome and my custom mRNA sequence (custom_file1.fna) I used the cat function, but before doing, I made sure that the custom mRNA sequence had the correct structure (info line + 80 characters per line)

please feel free to correct me if any of the approaches are incorrect.

fold -w 80 custom_file1.fna > custom_file1_formatted.fna

Output example:

>MAGRETH
CTGTGGGGCCTGGGCGAGCCTCCAGAGCACACCCTGCGGTACCTGGTGCTGCACCTGGCATCTCTTCAGCTGGGCCTGTT
GCTTAATGGCGTGTGCAGCCTGGCCGAGGAACTGAGACACATCCACTCTAGATACCGGGGCAGCTACTGGCGGACAGTTGA
GGAACTGAGTGCTGCACCTGGCATCTCTTCAGCTGGGCCTGTTGCTTAATGGCGTGTGCAGCCTGGCCGAGGAACTGAGAC
ACATCCACTCTAGATACCGGGGCAGCTACTGGCGGACAGTT

Merge files:

cat custom_file1_formatted.fna GCF_00001405.40_GRCh38.p14_genomic.fna > merged_ref_genome.fna

I hope that my approach is correct and can be used by others in the future, but how do I proceed with the gtf file? Looking at the genomic.gtf file I see the expected structure, but ofcourse this looks nothing like the custom_file1_formatted.fna sequence that I have:

#gtf-version 2.2
#!genome-build GRCh38.p14
#!genome-build-accession NCBI_Assembly:GCF_000001405.40
#!annotation-date 10/02/2023
#!annotation-source NCBI RefSeq GCF_000001405.40-RS_2023_10
NC_000001.11    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11
 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";

How do I proceed? How do I "add" the content of my custom sequence to the gtf file?

I appreciate any comment, thank you in advance :)

reference-genome genome gtf • 759 views

ADD COMMENT • link updated 4 months ago by Istvan Albert 100k • written 4 months ago by Dyno • 0

score 1 · Answer 1 · 2023-12-12

1

Entering edit mode

4 months ago

Istvan Albert 100k

It would be better if you also explained why you need to do this and what that sequence is supposed to represent. Then, we could advise on whether the approach is sound.

The word "merging" is not quite right; you have "concatenated/added" to the genome, not merged into it.

If your sequence or parts of it are already present in the human genome, the net effect of this concatenation is not so easy to predict. Thus the concatenation only makes sense if your sequence is unique and not already present in the genome.

As for an answer to your question, since you have concatenated the FASTA all you need to do is concatenate one or more lines to the GFF file.

The number of lines you add depends on how many features your new sequence hosts, if all you need is to mark that one sequence you've added, it would look like

MAGRETH  .  exon  1 4775  .  .  .   Parent=g1;gene_id=g1;transcript_id=t1

You have to fill in a few fields that match those that will be counting over. Here I filled in the default fields featureCounts will expect to see.

Read up on the GFF format for the number of columns. You can put a dot where you have no info.

ADD COMMENT • link 4 months ago by Istvan Albert 100k

0

Entering edit mode

Hi Istvan,

Thank you so much for your contribution. You're right that I should have addressed it as 'concatenated', this is noted.

First of, thanks for making me aware whether the sequence is a part or the full length of a gene, I was not given this information by the wet-lab researcher. (I have sent a request for this information to the researcher).

I will try to give a deeper understanding. I am doing the bioinformatic part of this question of research. I was not involved in the experimental design or the lab part, so if you have any questions, please feel free to ask and I will do my best to answer, otherwise I will probably forward the question, as I think the answer will be of interest to me as well.

Initially I were given the sequencing data of the study which consists of 2 groups: A group with the synthetic gene and a group without. I completed the analysis and the researchers were happy with the results. After some disussion, they wanted to confirm that there is some reads of the synthetic gene, therefore I was asked to do what I am asking help for in this thread. As mentioned, I currently do not know which part of the synthetic sequence is the gene, but I will hopefully know soon so that we can concatenate it with the gtf file.

edit: all the bases are included in the gene. So as you mentioned in your response, we have to set from 1 to 4774. I will try to work on this, i will return with either a result or a follow-up question :)

ADD REPLY • link 4 months ago by Dyno • 0

1

Entering edit mode

If all you need is to confirm a synthetic gene then you could just align to the synthetic gene alone (if the synthetic region is unique).

The reads that span the original sequence and the synthetic region boundaries would tell you a lot of wether that region exists as a whole. In this case, a visual inspection in IGV carries all the evidence you need.

Tools like featureCounts are not well suited for this purpose as they don'w have the ability to separate between identical regions.

A better tool that could do a job is Salmon, that has an expectation maximization algorithm. Add your sequence to the transcriptome, compute the index on this concatenated transcript then let the EM algorithm do its job. It should assign the reads to both original location and synthetic origin as appropriate with the proper weights.

ADD REPLY • link 4 months ago by Istvan Albert 100k

0

Entering edit mode

Thank you for this suggestion, it was much more straight forward to look at it using the Integrative Genomics Viewer, and I would say that it gives a deeper view with more clarifications as well.

I attempted to use the Salmon tool, but got stuck with the indexing as I repeatly got this warning:

[warning] Entry with header [NC_000001.11] was longer than 400000 nucleotides.  This is probably a chromosome instead of a transcript.

I get the point of the warning, but I did not try much to fix it as I will leave the salmon for another day.

Thanks again for assisting, have a nice one :)

ADD REPLY • link 4 months ago by Dyno • 0

0

Entering edit mode

When using Salmon you need to add your sequence to the fasta file that contains individual transcripts and not the genome. Happy to hear it all worked out in the end.

ADD REPLY • link 4 months ago by Istvan Albert 100k