I try to follow this link(GDC protocol of processing TCGA data) https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/#dna-seq-alignment-command-line-parameters to process some cancer exome sequencing file. I used TCGA bam file as a test. I first convert bam to fq, and run following command:
bwa mem -t 8 -T 0 -R '@RG\tID:'${sampleID}'\tSM:'${sampleID} -M $reference $readsFiles_split |samtools view -Shb -o "$sampleID"/step1.output.bam -
java -Xmx4g -jar ${binDir%/}/picard.jar SortSam CREATE_INDEX=true INPUT="$sampleID"/step1.output.bam OUTPUT="$sampleID"/step2.output.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=STRICT
java -Xmx4g -jar ${binDir%/}/picard.jar MergeSamFiles ASSUME_SORTED=false CREATE_INDEX=true INPUT="$sampleID"/step2.output.bam MERGE_SEQUENCE_DICTIONARIES=false OUTPUT="$sampleID"/step3.output.bam SORT_ORDER=coordinate USE_THREADING=true VALIDATION_STRINGENCY=STRICT
java -Xmx4g -jar ${binDir%/}/picard.jar MarkDuplicates CREATE_INDEX=true INPUT="$sampleID"/step3.output.bam VALIDATION_STRINGENCY=STRICT O="$sampleID"/step4.output.bam M="$sampleID"/step4.output.txt
However, I recieved error message in MarkDuplicates step saying :
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 6: test_tcga:HWI-EAS289_106503224:7:3:14221:11655
I used -M already. And I used TCGA bam file to convert to fq. Anyone has any idea what's going on? I am not quite familar with those steps. Thanks!
what is the version of picard ? try with VALIDATION_STRINGENCY=LENIENT
Picard 2.6.0 -SNAPSHOT I will try that Any other possibilities? Thanks!
Sorry I just found that read actually shows twice in fq file....
so 1.fq:
in 2.fq
here are my bam2fastq step commands...anything wrong :
so the fastq is wrong... there's nothing to do but removing those duplicates.