Question

Changing Chromosome Notation in Bam Files to Include Sample ID

0

Entering edit mode

6.2 years ago

dthorbur ★ 1.9k

I've found various ways to change the notation of the chromosomes in my bam files. However, would it be a bad idea to add my sample identifier to the chromosome notation? For example, changing chrX to chrX_US1 for the first US sample. I have a large data set and I'm going to be running analyses per chromosome, so I'm worried once I start I won't be able to determine which chromosome came from where.

Prior to making my sorted consensus sequences, all the samples were mapped with the same reference genome so shouldn't need to be realigned. Instead, I'm just going to move them all those I'm comparing into the same .fasta file.

I am very new to this, so could be making huge mistakes, hence asking on here.

Thanks in advance.

alignment bam samtools • 1.7k views

ADD COMMENT • link updated 6.2 years ago by Noushin N ▴ 600 • written 6.2 years ago by dthorbur ★ 1.9k

2

Entering edit mode

You should consider using read-groups instead of changing reference names.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks. That looks like it could be quite promising. However, I can't seem to find out if the read-group information is retained when converting from .bam to . fasta format. Will this information be retained?

ADD REPLY • link 6.2 years ago by dthorbur ★ 1.9k

0

Entering edit mode

If you split the bam files into read specific chunks then you can indirectly retain that information. It would not be directly transferred to the fasta files. You will need to rename the fasta after the fact to include the sample name in headers.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

You wrote

I have a large data set and I'm going to be running analyses per chromosome,

and

I can't seem to find out if the read-group information is retained when converting from .bam to . fasta format.

but since we have totally no idea what you are trying to accomplish we can't really give you a good answer to this. But in general, I agree with Noushin N that this sounds like a bad idea. I can't imagine a scenario in which this would be the best solution. In bam files, read groups are solving your problem. But you want (for unclear reasons) to keep that in fasta files. Also, when converting the bam file back to fasta you also lose the information of the mapping location.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Apologies for the lack of information. Ultimately I am going to be converting these files into .phylip format to use in the program VariScan. Unfortunately, there don't appear to be any direct conversions, so I have to convert the file into .fasta first.

For VariScan I need anywhere from 12 to 66 samples aligned by chromosome in one .phylip file. Thus, I need a way of identifying which sample is which. In my current workflow, I don't see a way to retain sample identity once they are in the same file.

I hope that clears up my aims and intentions.

ADD REPLY • link 6.2 years ago by dthorbur ★ 1.9k

0

Entering edit mode

Not sure if this will help you BamBam. May be worth a look.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your suggestions. I've finally come up with a solution, and it was so much easier than I had anticipated. You can directly edit the header of the consensus fasta sequence, which is apparently retained when converting to .phylip. All you need to do is keep a spreadsheet with the information usually kept in the fasta header.

ADD REPLY • link 6.2 years ago by dthorbur ★ 1.9k

score 1 · Answer 1 · 2018-02-05

This doesn't sound like a good idea. I realize that you mention the reads have already been re-aligned to a common reference; but alignment is typically just the first step in the analysis pipeline. Re-naming chromosomes to non-standard ones will likely result in error and/or inaccuracy in many downstream steps, such as annotation and variant calling.