Biostar Beta. Not for public use.
Skip orientation as there are not enough pairs by bwa mem
0
Entering edit mode
15 months ago
seta ♦ 1.2k
Sweden

Hi all,

I extracted the specific region from a BAM file generated by aligning fastq (Illumina/whole genome sequencing) on hg38 and converted to corresponding fastq reads of that region by Picard (SamTofastq) and try to realign to another (my own) reference sequence by bwa mem. But, it seems that bwa mem cannot understand the sequencing is as paired end since it frequently says “skip orientation FF as there are not enough pairs” during the mapping, it also happened for other orientations (FR, RF, and RR); however, there is 1259352 paired read. Subsequently, the properly paired percentage is very low. Could you please kindly help me out how I can solve the problem?

Edit and update: I also found that during the conversion of bam to fastq by SamTofastq (picard), it returned me: SAM validation error: ERROR: Found 13548 unpaired mates. I think, it may not be a real problem; actually, it seems that Picard just consider paired read to creat Fastq file. So, the real problem is with bwa mem, yes? Please kindly share your idea with me?

Here is the used commands:

samtools sort file.bam

sammtools index file.bam file.bam.bai

Extracted the region of interest:
samtools view -h -b file.bam  chr6:28,510,120-33,480,577 > extracted_new.bam

Create fastq
java -jar SamToFastq.jar I=extracted_new.bam F=file_1.fastq F2=file_2.fastq   


Alignment by BWA-MEM 

bwa index my_reference.fasta
bwa mem -L 10000 -a my_reference.fasta file_1.fastq F2=file_2.fastq > output.sam

Many thanks

ADD COMMENTlink
0
Entering edit mode

Show us the headers of files you extracted from original alignment. They likely have lost read origination information.

While trivial this error has come up in the past when people try to use the same file (e.g. R1/R1 instead of R1/R2) when doing the second alignment. Something to check on.

ADD REPLYlink
0
Entering edit mode

Sorry, file is on the cluster and can't transfer it. Could you please tell me where is the read orientation information in the BAM header, what I should look for? I did alignment with two fastq file R1/R2.

ADD REPLYlink
0
Entering edit mode

What does zcat filename_R1.fq.gz | head -4 and zcat filename_R2.fq.gz | head -4 show? If your files are uncompressed then replace zcat with cat in the command.

ADD REPLYlink
0
Entering edit mode

The read name in two fastq files generated by SamTofastq is the same except for /1 and /2. Also, the read count of two fastq file is the same. So, the problem is another place. In addition, someone says "bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that" in this post. What do you think?

ADD REPLYlink
0
Entering edit mode

You should post the exact commands you used to extract the reads from the bam file and subsequently to map the fastq files again

You should also do your best to provide the information requested when someone asks for additional information.

ADD REPLYlink
0
Entering edit mode

I added the used commands to the original post.

ADD REPLYlink
0
Entering edit mode

Either the original alignment or your conversion from samtofastq eliminated the identifiers that show which of the reads is from R1 and which is from R2. Since you assure us that reads from R1/R2 are in sync, you can fix that issue by using following tool from BBMap suite:

reformat.sh in1=R1.fq.gz in2=R2.fq.gz out1=fixed.R1.fq.gz out2=fixed.R2.fq.gz

add one of the two options below. Either should work.

addslash=f              Append ' /1' and ' /2' to read names, if not already present. 
addcolon=f              Append ' 1:' and ' 2:' to read names, if not already present.
ADD REPLYlink
0
Entering edit mode

Hi genomax.

Here

is the output of head -4 of two fastq files (sorry if the resolution is not enough high); as you can see read name in both fastq files is the same and are along with /1 and /2, respectively; so the read name and their identifier (/1 and /2) is correct. Is it possible the read name or the related identifiers become wrong in the whole fastq files? do you still suggest using repair.sh?

Thank you

ADD REPLYlink
0
Entering edit mode

My apologies. I meant to say reformat.sh instead of repair.sh (fixed now).

BTW: We don't see your fastq examples. You can copy and paste the lines (don't post as an image) and then format using the code button.


code_formatting

Your reads do have /1 and /2. So it is possible that there just are not enough reads for bwa to do that estimate.

ADD REPLYlink
0
Entering edit mode

Yes, the reads have /1 and /2 based on the head of two fastq files. There is 1259352 paired read, it isn't enough for bwa mem in your opinion? I also tested bowtie2, but the overall alignment rate was very bad (about 5.4%). With bwa mem, the mapping percentage is about 95%, but the properly paired is very low (about 10%). Please kindly share me any suggestions and ideas?

ADD REPLYlink
0
Entering edit mode

Are you getting that error/warning with a) original data AND b) extracted reads (~13K) for a specific region. I thought it was only the latter and thus there may not be enough data.

Why are you cherry picking data like this BTW? Is your own reference just for this smaller region? If it is for the whole genome why can't you use the full data to align?

/1 and /2 have not been used for a while. Is this old data?

ADD REPLYlink
0
Entering edit mode

Hi genomax,

Thank you for following the issue. Regarding your questions, I get the error/warning with extracted reads; there is more than 1M PE reads (not 13K) in the extracted bam file as I mentioned in my previous comment. Assuming Bwa mem may predict the wrong insert size distribution, I used CollectInsertSizeMetrics (from picard) to predict the insert size distribution in the extracted bam file; however, I’m not sure how these numbers should be feed to bwa mem via I flag. Could you please kindly show me an example?

2) yes, my own reference is related to just this small region. Actually, this region is HLA and I would like to visualize the reported alleles from HLA typing (of whole genome sequencing data) in IGV. So, I extracted this region, converted to fastq files and try to re-align to hla genomic sequences as reference. It will be great if you please share me any idea/suggestion on this issue.

Thanks

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3