Potential bug in mafsInRegion?
2
0
Entering edit mode
7.4 years ago
Thomas ▴ 160

Hello,

I am writing concerning the following UCSC utility (binary file): http://hgdownload.cse.ucsc.edu/admin/exe/

================================================================
========   mafsInRegion   ====================================
================================================================
mafsInRegion - Extract MAFS in a genomic region
usage:
    mafsInRegion regions.bed out.maf|outDir in.maf(s)
options:
    -outDir - output separate files named by bed name field to outDir
    -keepInitialGaps - keep alignment columns at the beginning and of a block that are gapped in all species

I have been playing with this executable a little bit, and have came across some odd behaviour:

If I pass a BED file with one record to mafsInRegion - we have success, and the expected behaviour is exhibited i.e. a region is extracted from the input MAF file to an output file according to the given genomic co-ordinates - e.g.

chr1    1233    1244    Transcript_1    (mafsInRegion processing succeeds)

However, If I pass a different BED file to mafsInRegion, this time with the same BED record, but this time with an additional BED record including either the same start or end co-ordinate of the preceding BED record, then mafsInRegion will fail - and apart from header information, the output file is empty:

chr1    1233    1244    Transcript_1    (mafsInRegion processing succeeds)
chr1    1233    1245    Transcript_2    (mafsInRegion processing fails)
chr1    1244    1245    Transcript_3    (mafsInRegion processing fails)

So my question is this:

  1. Is this a recognised problem in mafsInRegion?
  2. Is there a recognised solution to this problem?
  3. If not, what do you reckon will be the easiest way to fix this problem?

Thank you

mafsInRegion UCSC UCSC_genome_browser • 2.8k views
ADD COMMENT
2
Entering edit mode
6.9 years ago

Hello,

I have come across the same odd behaviour using mafsInRegion and from what I have noticed, the coordinates do not seem to matter, it's a problem with the number of lines in the bed file.

Running the program with a single line bed file returns the expected result. Running it with more than 5 lines will fetch the wrong region in some cases and will return an empty file for others.

Here's an example of a line that works when in a file by itself :

chr2    179456038       179462464       CATP00000883767.1       2       +

I get :

##maf version=1 scoring=autoMZ.v1
a score=0.000000
s hg19.chr2                        179456038 66 + 243199373 ATGGTAAGTACTGATGAGAAGTTGTCTGTTTCAACTGTGTAGTGCTCATCGGTTTTAATCTCACTC
s panTro2.chr2b                    183626670 66 + 248603653 ATGGTAAGTACTGATGAGAAGTTGTCTGTTTCAACTGTGTAGTGCTCATCGGTTTTAATCTCACTC
q panTro2.chr2b                                             999999999999999999999999999999999999999999999999999999999999999999
i panTro2.chr2b                    C 0 C 0
s gorGor1.Supercontig_0273903           2142 66 -      2212 
.....

which is the correct output.

when run in a file that contains 13 other lines and looks like this :

chr2    179406992       179407028       CATP00000004513.1       1       +
chr2    179436036       179440708       CATP00002698154.1       1       +
chr2    179442318       179442540       CATP00002345870.1       1       +
chr2    179464336       179464426       CATP00001176733.1       2       +
chr2    179500339       179500447       CATP00002315182.1       1       +
chr2    179474319       179474529       CATP00002157734.1       1       +
chr2    179629093       179638047       CATP00002900465.1       1       +
chr2    179479248       179479527       CATP00000210653.1       2       +
chr2    179436036       179462326       CATP00002016444.1       2       +
chr2    179439945       179440820       CATP00002345871.1       1       +
chr2    179404308       179404431       CATP00000332098.1       1       +
chr2    179482495       179482699       CATP00001007021.1       2       +
chr2    179412405       179412684       CATP00002348243.1       2       +
chr2    179456038       179462464       CATP00000883767.1       2       +

the coordinates get messed up and I get this :

 ##maf version=1 scoring=autoMZ.v1
a score=0.000000
s hg19.chr2                        179462327 96 + 243199373 AGACGCCTTGATGGCTCCTCTGGCAGTTCTTGATGACCATGG----ATGAGCTAATGGCTGTGGTCTCAATGGTGGCTTCTTGAGGTAAGGTTCTTTCAT
s panTro2.chr2b                    183632906 96 + 248603653 AGACGCCTTGATGGCTCCTCTGGCAGTTCTTGATGACCATGG----ATGAGCTAATGGCTGTGGTCTCAATGGTGGCTTCTTGAGGTAAGGTTCTTTCAT
q panTro2.chr2b                                             999999999999999999999999999999999999999999----999999999999999999999999999999999999999999999999999999
  ......

Notice how the start coordinates have changed in hg19.chr2 in the second case

When running the same multi-line file, the output file CATP00002345870.1.maf (corresponding to the bed file's first line) contains only :

##maf version=1 scoring=autoMZ.v1

However if I run the relevant line :

chr2    179406992       179407028       CATP00000004513.1       1       +

by itself, I get the correct output :

    ##maf version=1 scoring=autoMZ.v1
a score=0.000000
s hg19.chr2                        179406992 36 + 243199373 ATGCTGCTGCGACACTCTATGACCTCAGACTGCAAG
s anoCar1.scaffold_385                468376 36 +   1396552 ATTCTGCTTTTGCATTCTACGATTTCAGACTGTAGG
q anoCar1.scaffold_385                                      999999999999999999999999999999999999
i anoCar1.scaffold_385             I 8 C 0
s taeGut1.chr7                      17984185 36 +  39844632 ATTCTGCTCTTGCATTCTATAATTTCAGACTGCAGG
q taeGut1.chr7                                              999999999999999999999999999999999999
.....

I should specify that I am searching in the hg19/GRCh37 46-way MAF alignments downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/

This feels like a bug to me, although I really hope it isn't. Any ideas?

Thanks

ADD COMMENT
0
Entering edit mode

Can you send this question as an email to genome@soe.ucsc.edu? That way our whole team can see the question and has the opportunity to answer.

Thanks ChrisL from the UCSC Genome Browser

ADD REPLY
0
Entering edit mode

Hi Chris,

The question has been sent to genome@soe.ucsc.edu as well.

Thank you, Nikos

ADD REPLY
0
Entering edit mode
7.3 years ago
genecats.ucsc ▴ 580

Hello, Thomas.

Thank you for questions about our utility "mafsInRegion".

Unfortunately, I was unable to replicate this issue. My input MAF file was hg38/multiz7way.maf and a BED file that contained the following two lines:

chr21   31659621        31668831        uc002ypa.4.1  
chr21   31659621        31668931        uc002ypa.4

As you can see both transcripts start at the same position, but have slightly different end positions. My output using this file was a file that contained MAF lines just as I would have expected:

##maf version=1 scoring=blastz
a score=0.000000
s hg38.chr21     31659621 547 +  46709983 GTTTGGGGCCAGAGTGGGCGAGGCGCGGAGGT-------CTGGCCTATAAAGTAGTCGCGGAGACGGGGTGCTG---GTTTGCGTCGTAGTCTCCTGCAGCG-------TCTGGGGTTTC------
...

Perhaps you can provide more information about what MAF files you are using with your example BED lines? I also attempted to replicate the issue using the BED lines that you provided and I didn't have any luck getting any meaningful output. I made a file with just a line for "transcript_1" and then I made another with both "transcript_1" and "transcript_2". In both cases, I used the hg38/multiz7way.maf MAF file I referenced before and I only got the output:

##maf version=1 scoring=blastz

Which would seem to indicate that there were no alignments in the MAF at this position? (I can also confirm this by looking at the Genome Browser for hg38 at this position with the 7-way multiz track active: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=mspeir&hgS_otherUserSessionName=hg38_biostars_226762.)

If you are still having issues, it would be helpful if you could post them to our Google Groups forum: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome-mirror, that way our whole team can see the question and help with an answer.

Thanks,

Matthew from the UCSC Genome Browser

ADD COMMENT
0
Entering edit mode

Thank you Matthew - sorry, I have only just seen your message. I will look into this and get back to you.

Cheers, Thomas

ADD REPLY

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6