Question

A lot of unknown/weird gene names after alignment with cellranger or starsolo

0

Entering edit mode

3 months ago

Tolga ▴ 30

Dear community,

After running my single-cell pipeline (cellranger or starsolo) I get a great amount of weird feature names that are also not documented in the human protein atlas. they are contaning a point and are looking like this:

AC007750.1"  "AC016766.1"  "AC019197.1"  "AC009495.3"  "AC009495.1"  "AC009495.2"  "AC073050.1"  "AC016723.1" 
 [873] "AC007277.1"  "AC007405.1"  "AC007405.3"  "AC010092.1"  "AC104088.1"  "AC104088.3"  "AC104088.2"  "AC078883.2" 
 [881] "AC078883.1"  "AC016737.1"  "AC016737.2"  "AC010894.2"  "AC010894.3"  "AC093459.1"  "AC017048.3"  "AC092162.2" 
 [889] "AC073636.1"  "AC074286.1"  "AC079305.3"  "AC079305.1"  "AC019080.1"  "AC019080.5"  "AC019080.6"  "AC012499.1" 
 [897] "AC009948.3"  "AC009948.2"  "AC010680.3"  "AC010680.2"  "AC010680.4"  "AC010680.5"  "AC092640.1"  "AC009478.1" 
 [905] "AC068196.1"  "AC009962.1"  "AC064871.2"  "AC021851.1"  "AC021851.2"  "AC093639.3"  "AC096555.1"  "AC096667.1" 
 [913] "AC009315.1"  "AC097500.1"  "AC017071.1"  "AC017101.1"  "AC007319.1"  "AC092598.1"  "AC133106.1"  "AC013468.1" 
 [921] "AC093388.1"  "AC108047.1"  "AC006460.2"  "AC006460.1"  "AC005540.1"  "AC067945.3"  "AC067945.4"  "AC092614.1" 
 [929] "AC098617.1"  "AC096647.1"  "AC010983.1"  "AC104823.1"  "AC064834.1"  "AC114760.2"  "AC013264.1"  "AC011997.1" 
 [937] "AC011997.2"  "AC019330.1"  "AC020718.1"  "AC097717.1"  "AC012459.1"  "AC007163.1"  "AC005037.1"  "AC007272.1" 
 [945] "AC007279.2"  "AC069148.1"  "AC079354.3"  "AC064836.4"  "AC064836.2"  "AC064836.3"  "AC007736.1"  "AC007383.2" 
 [953] "AC007383.3"  "AC008269.1"  "AC007879.3"  "AC009226.1"  "AC096772.1"  "AC007038.2"  "AC007038.1"  "AC006994.2"  ...

I have in total 26000 features in my seuratobject after aligning and doing the QC. Anyone has an Idea what those genes are and how to deal with them? Because when doing an enrichment analysis I cant map those gene names to Ensembl ID for instance.

Best,
Tolga

Cellranger • 617 views

ADD COMMENT • link 3 months ago by Tolga ▴ 30

1

Entering edit mode

Where did you get your reference from? 10x provides pre-made indexes for human genome.

The ID's you have are for old BAC clones that were used in initial stages of human genome sequencing. e.g. https://www.ncbi.nlm.nih.gov/nuccore/AC064836

ADD REPLY • link 3 months ago by GenoMax 142k

0

Entering edit mode

I have my reference from the 10x website, where I download the human reference from 2020:

curl -O "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz"

and I ran the following command:

/data/Tolga/cellranger/cellranger-7.2.0/cellranger count \
  --id ${srr} \
  --transcriptome /data/Tolga/reference/refdata-gex-GRCh38-2020-A \
  --fastqs /data/Tolga/sra_files \
  --sample ${srr} \
  --expect-cells ${expectedcells[$counter]} \
  --localcores 30

So I provided the folder with the pre-made indexes and the annotation files.

But I don't get why I have those features in, especially the BAC clones..

Best regards

ADD REPLY • link updated 3 months ago by Ram 43k • written 3 months ago by Tolga ▴ 30

score 2 · Answer 1 · 2024-02-03

2

Entering edit mode

3 months ago

ATpoint 82k

Not sure what you want to hear. The majority of annotated genes has no functional term in databases such as KEGG or REACTOME. That goes for non-coding as well as protein-coding genes. Usually you just do your normal analysis and for enrichment simply ignore those with no functional annotation.

ADD COMMENT • link 3 months ago by ATpoint 82k

0

Entering edit mode

I am wondering if this is „normal“? So I interpret your answer as this is usual. My downstream analysis doesn’t seem to be affected by that in a way that leads to unreasonable results. So I guess I am good with ignoring „those genes“?

But what ist really intriguing to me, that after doing pseudotime analysis, I get a great amount of genes at latest pseudotime that are not documented in the human protein atlas for instance.

This just puzzles me.. how to deal with these genes.. how to figure out what they are or if they are useful.

ADD REPLY • link 3 months ago by Tolga ▴ 30

0

Entering edit mode

I see your point but think there is no problem. There are thousands of non-coding genes, hence they do not appear in a protein atlas. Thousands of people use the hg38 and cellranger reference. You will likely not figure out what each gene does but if they drive pseudotime suggests that they're (unsurprisingly) important. I would not worry too much about this and 'just continue'.

ADD REPLY • link 3 months ago by ATpoint 82k

0

Entering edit mode

Thanks for the insights!

Just to clarify: So you recommend to ignore the genes, even if they drive pseudotime? Or shall I delve into them to somehow find out what they do? I guess I will try to figure there functionality out and when It turns out to be unsuccessful , I will „just continue“ with the annotated ones

ADD REPLY • link 3 months ago by Tolga ▴ 30