This might be a newbie question, I'm a QM Chemist stepping in for a bioinformatician at work, so I am sorry in advance for the lack of necessary information required to help with my question.
I have a documented number of steps to follow that allows me to align my paired-end reads to a human reference genome and then perform variant calling. I am using Samtools, GATK and Picard. I am also using the same reference gnome fa file as my colleague.
However, when I perform the variant calling and I look inside the sam file I generated, I only have reference names "ref|NT_.....|". The original files generated by the bioinformatician have "NC...." as reference names. The code further down the pipeline will require the NC... naming structure.
I don't want this question to feel too much like a black box, but if I could get a general idea of what will have caused the difference in read reference names, I would be really grateful and I can go from there.
Thanks
You must have aligned your data to a reference collection that had
ref|NT..
names instead of theNC
names.Would this be something inside the human genome *.fa file?
Yes. Take a look at
grep "^>" .fa
and see if that is what you have.You need to use matched genome sequence/annotation for this reason.
Ah I see! Thanks. I didn't spot a single NC notation. The file I have is hg38_GRCh38.p12.allChr.fa - any idea where to get hold of the corresponding reference file with NC rather than ref|NT/NW? I appreciate your help, I'm a little out of my skill set and comfort zone at the moment.
Do you know where you got that file from? You can get matching reference and GTF files from this page. You will need to realign your data though.
Here is an informative blog post that you will find useful about which human reference to use.
I am afraid I don't recall where that file on my system came from. Thank you for that information, I'll try and plod on from here.