Basic SAMtools tutorial
2
3
Entering edit mode
8.1 years ago
statfa ▴ 760

HI,

I need some help to start working with SAMtools. As a beginner with biostatistics background, it is my first time that I'm using Linux Ubuntu (I have installed it on my windows by making a machine on VMware for illustration purpose). Now I'm going to know how to work with SAMtools to carry out basic commands such as:

  1. importing bam/sam files into samtools
  2. Viewing the data
  3. Sorting bam files
  4. Calling variants

I have already searched on the internet and have found some tutorials such as http://www.htslib.org/doc/samtools.html, but I don't have any idea what those commands are!

could anybody please tell me where I can find a step by step tutorial for a beginner who even doesn't know how to read such commands?

If I am right, I should run samtools on terminal... by using the commands : "sudo apt-get install samtools" and then writing "samtools", I think I have run samtools but how should I write the commands in samtools on terminal?

I would also be grateful if someone tells me where I can find some low sized bam files to test samtools.

Thanks a lot!

samtools • 9.1k views
ADD COMMENT
3
Entering edit mode
8.1 years ago
Sinji ★ 3.2k

It sounds like you are missing a lot of basic Unix knowledge that will make it difficult to progress without. If you are struggling with basic concepts such as terminal commands and their meaning, this will require you spend time and learn more about how to use a terminal in a Linux environment, and then spend some time reading the manual for the different programs you intend to use.

I'm going to go ahead and encourage you to think about using Galaxy for your project. It's a open, web-based platform for a variety of data analysis tasks and best of all features a GUI component for those that aren't very good or are rather new a terminal. Here seems to be a pretty good guide for variant calling using Galaxy, though admittedly i've never preformed this type of analysis so I cannot be entirely sure.

I also suggest that you somehow get a local machine with a copy of a Linux distro such as Ubuntu installed (perhaps ask your PI if this is possible), and not just a windows machine running a virtual machine. This will force you to get used to the Linux operating system and will make installing bioinformatic software such as samtools and eventually learning the terminal and the command line easier.

Here are a couple of links to get you started navigating the terminal and learning about the shell:

Codeacademy

Command-Line Crash Course

Learning The Shell

Learn To Shell The Hard Way

ADD COMMENT
0
Entering edit mode

Thanks a lot for your help.

Yes, you are right about me missing basic Unix knowledge. Today is the first day that I see Linux environment. But do you think the reason I cannot understand those manuals is that I don't know the basic terminal commands and their meaning? Because when I installed R and ran it in terminal, I was able to work with it and knowledge in terminal commands was not required. Commands in Linux version of R were the same as its windows version. That is why I thought learning terminal commends to run samtools and IGV is not required either. By learning terminal commands do you think I will be able to understand those manuals?

For my project I need to work with samtools because I have to ascertain my bam files are sorted and if they are not, I need to sort them. Do you think Galaxy can help me in this case?

Thank you a lot

ADD REPLY
2
Entering edit mode

samtools and similar things aren't environments you work inside, they're just additional commands installed (most tools are like this). You're going to have a much easier transition if you start using Galaxy (downloading results and processing things further in R as needed).

ADD REPLY
1
Entering edit mode

But, by the way, I forgot to ask this, if galaxy is web based, does it mean I have to analyze my data online? That would be very difficult to upload all those large bam files.

ADD REPLY
2
Entering edit mode

That is correct and you probably don't want to do that if your internet access is bandwidth limited.

ADD REPLY
1
Entering edit mode

Well, it's semi-correct. You can run the Galaxy webserver on your own computer. Ubuntu Linux is a very general-purpose version of Linux. People often replace Windows with Ubuntu. There are however versions of Linux with all the bioinformatic tools already installed - including Galaxy. Probably the most famous is BioLinux which you can download and run just like your Ubuntu (in VMWare) from here: http://environmentalomics.org/bio-linux/

Once that's running, you can even communicate with Galaxy in BioLinux from within Windows if you know the IP of your virtual machine. Good luck!

ADD REPLY
1
Entering edit mode

Oh that's very helpful... Once I was recommended to use bio-linux but I couldn't understand its main purpose and I aimed to test it... Now, by your explanation it is more clear to me... I'm downloading it now... Only a question: As I have been told before, BioLinux is more user friendly, right? Does it mean it is easier there to work with samtools and other tools like Galaxy? And to create a virtual machine on VMWare to install BioLinux, how large should the hard disk space be to give to it? For Linux Ubuntu, I have given 30 Gb hard disk space to it which I think is a lot... Thank you again

ADD REPLY
1
Entering edit mode

What I think you're asking for is a GUI for the common bioinformatics tools which makes them more intuitive to use (like, you dont have to know that -F in samtools means filter-reads-away while -f is filter-reads-to-keep). The closes thing you are going to get to a GUI is Galaxy, which aims to provide a consistent visual wrapper around all the common tools (and also other things like running a bunch of commands in order, etc). Installing Galaxy can be a pain - particularly installing and integrating all the common tools into it. That is what BioLinux provides. Nothing else on top, unfortunately. Just the installation part of all the software. Furthermore, using Galaxy will only get you so far. You will still find yourself having to learn all the details of the tools anyway. Even if Galaxy gives you a box called "filter reads away with flag: ", which is abstracting away that -F bit that you'd have to use in the terminal, you still need to know that every read in a BAM file has a single number that represents 12 things that the read can be true or false for (mapped, forward strand, duplicate, etc), and you have to add these numbers up (or use this calculator) to combine multiple flags. In otherwords, and nice as Galaxy is, there's little you can do without knowing what each tool integrated into it does, and how those tools work. This is, after all, why a Bioinformatician is a job and not a button on some website :P

Buuuut, the people on Biostars are super-friendly, and will help you if you get stuck - so if it seems like an impossible task at times, dont worry! We all went through it at some point :)

ADD REPLY
0
Entering edit mode

Thank you :) ... Your explanation was comprehensive... I don't want to go deep into bioinformatics... Here it is my previous post C: NGS data analysis and what I am going to do... I'm installing BioLinux and want to start navigating the terminal and learning about the shell by the links provided above... Thanks again

ADD REPLY
0
Entering edit mode

Aha... So samtools are commands not environment like R... I understand it now... Thank you... I will start using Galaxy to see how it is... I hope it works

ADD REPLY
2
Entering edit mode
8.1 years ago
GenoMax 141k

Spend some time here learning the basics of UNIX.
You can also take a look at the online materials for a NGS data analysis course that can be found here. This would give you an overall idea of steps you would need to learn and the materials for that.

ADD COMMENT
0
Entering edit mode

Ok sure... I think learning UNIX is unavoidable... Thanks a lot for your guidance...

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6