Question

How to interpret the title of each sequence in a reference genome fasta file downloaded from Ensembl

0

Entering edit mode

12 months ago

hande • 0

Hi all,

I have a very simple question but I could not find the answer anywhere else.

Let's assume I have the human reference genome downloaded from Ensembl. The first line looks like this:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF

I understand that the lines after this line will have the reference DNA sequence for chromosome 1, from start position 1 to the end position 248956422. So, the fields after "GRCh38:" are {chromosome}:{start position}:{end position}:1 REF. What does the extra "1" represent after the "{end position}:"?

I am asking because I would like to extract the sequence for a subset of positions on the chromosome, and I would like to make sure I have the title of that sequence correct in my modified fasta file. For example, if I want to have the sequence only for one gene which is on chromosome 1 with a start position 1500 and end positon 2500, would the following be the correct title for the sequence?

>1 dna:chromosome chromosome:GRCh38:1:1500:2500:1 REF
{add here the sequence between the positions 2500 and 2500}

What does the extra :1 after the end position denote? Should I also modify it so that the alignment tools (e.g. STAR) interpret it correctly?

I would like to exclude all the remaining parts of the sequence because I know that no reads from those parts are present in my fastq files and I don't want any read to be mapped to those remaining parts by mistake. That's why, I'm trying to make it look like the whole sequence of chromosome 1 consists of only the sequence of that one gene.

It would be great if anyone could give feedback on this. Thanks a lot in advance!

Some more bacground if it helps: I would like to combine the reference sequence of one gene from one species with the whole genome sequence of another species, and I would like to keep the same formatting in the combined fasta file.

reference fasta dna alignment genome sequence • 747 views

ADD COMMENT • link updated 12 months ago by Emily 23k • written 12 months ago by hande • 0

0

Entering edit mode

I would like to exclude all the remaining parts of the sequence because I know that no reads from those parts are present in my fastq files and I don't want any read to be mapped to those remaining parts by mistake.

It's usually a bad idea. Exome Sequencing: Masking The Non-Genic Sequences ?

ADD REPLY • link 12 months ago by Pierre Lindenbaum 161k

score 0 · Answer 1 · 2023-05-02

0

Entering edit mode

12 months ago

Emily 23k

1 means positive strand sequence. -1 means negative strand sequence.

ADD COMMENT • link 12 months ago by Emily 23k