Question

How can I normalize my reads?

0

Entering edit mode

4.9 years ago

apl00028 ▴ 90

I got a sam file after do an alignment (of my NGS data), using this function:

bowtie2 -f -x scaffold_filt_PV014_OD_DB.fa PV002_cmv3.fa -S scaffold_nofilt_OD_PV002.sam

After it, I change to bam file :

samtools view -bS scaffold_nofilt_OD_PV002.sam > scaffold_nofilt_OD_PV002.bam

I removed the unpaired reads:

samtools view -F 0x04 -b scaffold_nofilt_OD_PV002.bam >scaffold_nofilt_OD_rm_PV002.bam

I select those with a map quality higher than 10

samtools view -q 10 -b scaffold_nofilt_OD_rm_PV002.bam > scaffold_nofilt_OD_MapQ10_PV002.bam

I sort and I index it:

samtools sort scaffold_nofilt_OD_MapQ10_PV002.bam scaffold_nofilt_OD_PV002_index
samtools index scaffold_nofilt_OD_PV002_index.bam

When I am visualizing these bam files I got different coverages among my libraries. I would like normalize this coverage among them, but I don't know how to do that.

Can you guide me how can I normalize them?

Thank in advantage.

RNA-Seq alignment • 3.8k views

ADD COMMENT • link updated 4.9 years ago by GenoMax 141k • written 4.9 years ago by apl00028 ▴ 90

0

Entering edit mode

samtools view -F 0x04 -b scaffold_nofilt_OD_PV002.bam >scaffold_nofilt_OD_PV002.bam

In unix, remember that redirections are among the first operations executed, which means >scaffold_nofilt_OD_PV002.bam is executed first, obliterating the content of the file scaffold_nofilt_OD_PV002.bam. Your code should stop working at that step unless what you've pasted here is not what you ran.

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

well, I run them using different directories for input files and output files

ADD REPLY • link 4.9 years ago by apl00028 ▴ 90

0

Entering edit mode

Should have made that clear in your command as the command would not run as-is.

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

What kind of data are these? Normalization for which purpose (visualisation, some differential analysis, if so which)?

ADD REPLY • link 4.9 years ago by ATpoint 81k

0

Entering edit mode

My data comes frome NGS data which I aligned against a database. I would like normalize it for find the polymorphisms, due to I have different coverages in each library.

ADD REPLY • link 4.9 years ago by apl00028 ▴ 90

2

Entering edit mode

Data from NGS is like saying a car comes from a factory but does not tell the brand at all, please be more specific so that we can help you. Is this exome sequencing or something similar that is actually intended to find polymorphisms or is it something like RNA-seq that you now try to tweak in order to find them? Given that the files are called scaffold... I guess it is whole genome or exome sequencing. If so, there is no need for normalization. Use any of the established variant callers such as GATK, Strelka etc (use the search function please for more suggestions) to find variants. The tools will ahndle normalization internally.

ADD REPLY • link 4.9 years ago by ATpoint 81k

0

Entering edit mode

This is RNA-seq, and I am trying to modify them.

ADD REPLY • link 4.9 years ago by apl00028 ▴ 90

1

Entering edit mode

I don’t understand how coverage (quantitative data) relates to polymorphisms (qualitative data). Why do you need to normalize? What do you hope to achieve?

The concepts "RNAseq", "normalization" and "polymorphisms" don't make sense together. Please explain what you mean by normalization, how it is useful in this scenario and why you're looking for polymorphisms using RNAseq in the first place (taking you further away from changes at the DNA level and being unable to account for any changes at splice sites / RNA modifications)

ADD REPLY • link 4.9 years ago by Ram 43k

score 4 · Accepted Answer · 2019-05-20

4

Entering edit mode

4.9 years ago

GenoMax 141k

To answer your question theoretically you can use bbnorm.sh from BBMap suite to normalize your data. You can find a guide here.

Please consider the section about When Not To Use BBNorm which clearly states that:

Never normalize read data prior to a quantitative analysis (like ChIP-seq, RNA-seq for expression profiling, etc)

You should also consider this.

Also, do not normalize data prior to mapping for variant discovery; it will cause bias. If you need to reduce data volume in any of these scenarios, use subsampling rather than normalization.

You can sub-sample data using reformat.sh from the BBMap suite.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

ADD COMMENT • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your answer, I have analyzed it and I am in the second scenario. I want normalize the number of reads mapping variant discovery. So. should do I use subsampling, could you guide to me?

ADD REPLY • link 4.9 years ago by apl00028 ▴ 90

0

Entering edit mode

What do you understand from the documentation that genomax posted above? Can you please be more specific in your questions - people will be more willing to help you if they see you put in some effort to solve things on your own.

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

Dear genomax, I think that this tool (BBMap) is the most appropiate for my goal. Do you know any article where it is describe for reduce the variation of NGS reads depth along the reference genome to describe polymorphism variations? Thanks in advantage

ADD REPLY • link 4.9 years ago by apl00028 ▴ 90