mm10: Sequence length difference between interval file and reference
1
0
Entering edit mode
6.1 years ago
NB ▴ 960

hello, I have mouse exome data, for which I have downloaded the reference from ucsc. I am trying to create an interval file from Agilent SureSelect

This is the command I am using

 java -jar  picard.jar BedToIntervalList I=S0276129_Covered.bed O=S0276129_Covered.interval_list SD=mm10_genome.dict

This is the error I get

chr1 was past the end: 195471971 < 196469947
chr5 was past the end 151834684 < 151842168
chr7 was past the end: 145441459 < 145451439
chr8 was past the end: 129401213 < 129458847
chr12 was past the end: 120129022 < 120129244
chr14 was past the end: 124902244 < 125075837
chr16 was past the end: 98207768 < 98218510
chr17 was past the end: 94987271 < 95126542
chr18 was past the end: 90702639 < 90702728

Any idea on why the sequence length differ from the dictionary file and the interval file and how can I correct this ?

many thanks,

mm10 sequence length ucsc mouse reference • 1.4k views
ADD COMMENT
2
Entering edit mode
6.1 years ago

the error is here: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/util/BedToIntervalList.java#L158

may be you're using the wrong mm10_genome.dict , or the S0276129_Covered.bed contains coordinates that overflow the 'dict' file. .

use awk to remove the bad lines ?

e.g:

awk '(($1=="chr1" && int($3)<= 195471971) || ($2=="chr5" && int($3) <= 151834684 ))' in.bed > out.bed
ADD COMMENT
0
Entering edit mode

Thanks Pierre. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

ADD REPLY
0
Entering edit mode

. The sequence dictionary has been created using the fasta file and picard's CreateSequenceDictionary function, is there a possibility of that going wrong ?

no , so the problem would com from from your bed (it is mm10 ?)

Also, the bad lines can be removed but will this affect the "target areas" for variant calling ?

We have not idea about the way you're going to use this interval file, which tool ?

you can always trim the bed.

(...) if($1=="chr1") printf("%s\t%d\t%d\n",$1,$2,($3 <= 195471971 ?$3:195471971));(....)
ADD REPLY
0
Entering edit mode

Yes its mm10. This interval file is being prepared to be used for GATK variant calling.

ADD REPLY

Login before adding your answer.

Traffic: 1412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6