Question

How Can I Get These Information About Bam Files?

6

Entering edit mode

13.4 years ago

Biomed 5.0k

I have three bam files, each bam file contains data from a sequencing lane. These three lanes represent the whole exome sequence of a single patient. If I had not known that these three bam files belong to the same sequencing run, is there a way to figure out that these files are from the same study and from different lanes?

Another way of asking the same question lets assume I were given only two of these files. How would I figure out that the third one is missing?

How can I understand if the bam files contained aligned reads or unaligned reads?
Do I need to merge these before I do any analysis like aligning, variant calling etc?

Thank you

bam next-gen sequencing • 13k views

ADD COMMENT • link updated 7.9 years ago by Biostar 20 • written 13.4 years ago by Biomed 5.0k

0

Entering edit mode

For 3 - BAM files are (generally) post-alignment

ADD REPLY • link 13.4 years ago by Aaron Statham ★ 1.1k

0

Entering edit mode

Thanks, but to make sure that they are aligned do I need to convert to SAM and check the flags?

ADD REPLY • link 13.4 years ago by Biomed 5.0k

score 7 · Answer 1 · 2010-12-10

"These three lanes represent the whole exome sequence of a single patient. If I had not known that these three bam files belong to the same sequencing run, is there a way to figure out that these files are from the same study and from different lanes?"

The short answer to your first question is that there is no way instrinsic to BAM format data that allows you to be sure of deriving this information. Your rephrasing actually asks a slightly different, but related and equally important question. The answer to both is that you can only hope that the data providers followed a good scientific record-keeping regime such as MINSEQE outside of the BAM files.

BAM file headers are not sufficiently structured to represent an experimental design. The headers may contain "read group" records which, if present, must contain a "sample name". What is a valid "sample name" is not specified. If your 3 lanes are a single sample split into 3 lanes at the point of loading onto the flowcell(s), then they will probably have the same "sample name". There are also optional "library" and "description" fields that may be present in a "read group" record, which may tell you something. Also the sequencing platform (e.g. Illumina) and platform unit (e.g. lane) fields may tell you something, as might the date of sequencing.

Unfortunately, most BAM headers are optional and IMO their fields are too vaguely defined to be very useful. They are particularly difficult to use computationally, effectively being free text.

"How can I understand if the bam files contained aligned reads or unaligned reads?"

They may contain both. Each alignment record contains a flag field which is an integer. This is interpreted in its binary representation, with each bit having a different meaning. There is a bit to indicate that the query read is mapped and a bit to indicate that its mate is mapped (PacBio reads will cause problems!). You will need to scan the file to count the different flags.

Some sequencing centres use BAM files for all unaligned reads because they contain a superset of the data found in Fastq files.

"Do I need to merge these before I do any analysis like aligning, variant calling etc?"

Not necessarily. In fact, we often split BAM files into many parts to speed alignment by mapping them in parallel where appropriate e.g. when using BWA. Then we might merge them afterwards. It depends on the software you are using.

Ram · Answer 2 · 2010-12-09

3

Entering edit mode

13.4 years ago

Pierre Lindenbaum 161k

?
export BAM to SAM using samtools view and check the flag (see http://picard.sourceforge.net/explain-flags.html)
use samtools merge

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I edited the question to the best of my ability. I hope it is more clear now.

ADD REPLY • link 13.4 years ago by Biomed 5.0k

0

Entering edit mode

Pierre, thanks for your answer. I understand that it is not possible to do this without converting to sam and looking at the flags. Also I assume I have to merge the files into a single bam for all downstream analysis.

ADD REPLY • link 13.4 years ago by Biomed 5.0k

1

Entering edit mode

samtools view mydata.bam  | head -n 50

will show you the first 50 lines of your .bam, after the header.

samtools view -h mydata.bam  | head -n 100

will show you the first 100 lines including the header. So you don't have to convert the whole thing to sam.

Or

samtools view mydata.bam | cut -f 2 | sort | uniq -c | sort -nr

will tell you all the flags present, and how many times each one is seen.

ADD REPLY • link 7.9 years ago by swbarnes2 14k

0

Entering edit mode

No, you can work directly on the BAM file e.g. with Picard (see http://picard.sourceforge.net)

ADD REPLY • link 13.4 years ago by biobot 0.0.77.a.1099 6.2k