Inconsistent query coordinates between chain and net alignments in UCSC
1
1
Entering edit mode
7.1 years ago
BlastedBadger ▴ 160

I am searching for a gene in the rock hyrax genome (proCap1) using the mouse ortholog, which is located in mm10 at chr17:71344493-71475343.

Using the UCSC mysql database, I don't get the same coordinates in proCap1, depending on whether I search in the "chain" or the "net" alignment.

Once I am connected to the database,

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A mm10

I do these queries:

# Chain
SELECT tName,tStart,tEnd,qStrand,qName,qStart,qEnd,id,score FROM chainProCap1
WHERE (tName="chr17"
    AND (71344493 <= tStart AND tStart <= 71475343
        OR 71344493 <= tEnd AND tEnd <= 71475343))
ORDER BY tStart;

# Net
SELECT level, tName,tStart,tEnd,strand,qName,qStart,qEnd,score,chainId FROM netProCap1
WHERE (tName="chr17" AND type="top"
    AND (71344493 <= tStart AND tStart <= 71475343
        OR 71344493 <= tEnd AND tEnd <= 71475343))
ORDER BY tStart;

And their results:

Chain:

+-------+----------+----------+---------+-----------------+--------+--------+---------+-------+
| tName | tStart   | tEnd     | qStrand | qName           | qStart | qEnd   | id      | score |
+-------+----------+----------+---------+-----------------+--------+--------+---------+-------+
| chr17 | 71344307 | 71345537 | -       | scaffold_115863 |   4129 |   5390 |   98261 | 33044 |
| chr17 | 71348765 | 71349076 | -       | scaffold_45687  |   7675 |   7995 | 1069519 | 10518 |
| chr17 | 71349501 | 71350225 | -       | scaffold_286389 |    301 |    912 |  600619 | 14552 |
| chr17 | 71353320 | 71353647 | -       | scaffold_45687  |  13998 |  14354 |  450458 | 16409 |
| chr17 | 71356976 | 71366664 | -       | scaffold_58060  |   6662 |  18451 |   59516 | 43785 |
| chr17 | 71369677 | 71369991 | +       | scaffold_136193 |   5810 |   6136 |  677010 | 13810 |
| chr17 | 71370047 | 71381518 | -       | scaffold_39409  |    653 |  16004 |   26685 | 77165 |
| chr17 | 71386776 | 71400977 | -       | scaffold_6004   |   3376 |  31456 |   59454 | 43813 |
| chr17 | 71403036 | 71403303 | -       | scaffold_6004   |  36494 |  36767 |  851074 | 12305 |
| chr17 | 71411610 | 71412070 | -       | scaffold_6004   |  45733 |  46196 |  410158 | 17057 |
| chr17 | 71415580 | 71415876 | +       | scaffold_173989 |     37 |    338 |  795373 | 12765 |
| chr17 | 71421335 | 71421684 | -       | scaffold_92096  |    962 |   1365 |  860242 | 12230 |
| chr17 | 71426009 | 71429765 | -       | scaffold_6004   |  65154 |  68927 |   32900 | 65989 |
| chr17 | 71431173 | 71441015 | -       | scaffold_87525  |    315 |   2878 |   68830 | 39933 |
| chr17 | 71443948 | 71444160 | -       | scaffold_87525  |   5699 |   5910 | 1381300 |  7952 |
| chr17 | 71447491 | 71448878 | +       | scaffold_47467  |    712 |   1665 |  147828 | 27038 |
| chr17 | 71455537 | 71465879 | -       | scaffold_23639  |   5711 |  12225 |   68263 | 40118 |
| chr17 | 71458177 | 71459898 | +       | scaffold_1036   | 107842 | 113323 |   25364 | 80102 |
+-------+----------+----------+---------+-----------------+--------+--------+---------+-------+
18 rows in set (1.09 sec)

Net:

+-------+-------+----------+----------+--------+-----------------+--------+--------+-------+---------+
| level | tName | tStart   | tEnd     | strand | qName           | qStart | qEnd   | score | chainId |
+-------+-------+----------+----------+--------+-----------------+--------+--------+-------+---------+
|     1 | chr17 | 71344307 | 71345537 | -      | scaffold_115863 |    270 |   1531 | 33044 |   98261 |
|     1 | chr17 | 71348765 | 71349076 | -      | scaffold_45687  |  10669 |  10989 | 10518 | 1069519 |
|     1 | chr17 | 71349501 | 71350225 | -      | scaffold_286389 |    137 |    748 | 14552 |  600619 |
|     1 | chr17 | 71353320 | 71353647 | -      | scaffold_45687  |   4310 |   4666 | 16409 |  450458 |
|     1 | chr17 | 71356976 | 71366664 | -      | scaffold_58060  |    159 |  11948 | 43785 |   59516 |
|     1 | chr17 | 71369677 | 71369991 | +      | scaffold_136193 |   5810 |   6136 | 13810 |  677010 |
|     1 | chr17 | 71370047 | 71381518 | -      | scaffold_39409  |   2627 |  17978 | 77165 |   26685 |
|     1 | chr17 | 71386776 | 71400977 | -      | scaffold_6004   |  37912 |  65992 | 43813 |   59454 |
|     1 | chr17 | 71403036 | 71403303 | -      | scaffold_6004   |  32601 |  32874 | 12305 |  851074 |
|     1 | chr17 | 71411610 | 71412070 | -      | scaffold_6004   |  23172 |  23635 | 17057 |  410158 |
|     1 | chr17 | 71415580 | 71415876 | +      | scaffold_173989 |     37 |    338 | 12765 |  795373 |
|     1 | chr17 | 71421335 | 71421684 | -      | scaffold_92096  |   6512 |   6915 | 12230 |  860242 |
|     1 | chr17 | 71426009 | 71429765 | -      | scaffold_6004   |    441 |   4214 | 65989 |   32900 |
|     1 | chr17 | 71431173 | 71441015 | -      | scaffold_87525  |   6996 |   9559 | 39933 |   68830 |
|     1 | chr17 | 71443948 | 71444160 | -      | scaffold_87525  |   3964 |   4175 |  7952 | 1381300 |
|     1 | chr17 | 71447491 | 71448878 | +      | scaffold_47467  |    712 |   1665 | 27038 |  147828 |
|     1 | chr17 | 71455537 | 71456659 | -      | scaffold_23639  |  19155 |  20551 | 23992 |   68263 |
|     1 | chr17 | 71458177 | 71459898 | +      | scaffold_1036   | 107842 | 113323 | 80102 |   25364 |
|     1 | chr17 | 71459898 | 71465879 | -      | scaffold_23639  |  14037 |  18794 | 16060 |   68263 |
+-------+-------+----------+----------+--------+-----------------+--------+--------+-------+---------+
19 rows in set (0.34 sec)

As you can see, the blocks in Mm10 genome are the exact same (tName, tStart, tEnd), the length and the score in proCap1 are the same, so it is probably the same region aligned, but why are qStart and qEnd different? Which one should I choose to extract the sequences from the hyrax genome in .2bit format?


As a possibly useful information, I did make the equivalent query using proCap1 as the reference:

use proCap1;
# Net
SELECT level, tName,tStart,tEnd,strand,qName,qStart,qEnd,score,chainId FROM netMm10
WHERE (qName="chr17" AND type="top" AND
    (71344493 <= qStart AND qStart <= 71475343
        OR 71344493 <= qEnd AND qEnd <= 71475343))
ORDER BY qStart;
# -> 16 rows

# Chain
SELECT tName,tStart,tEnd,qStrand,qName,qStart,qEnd,score,Id FROM chainMm10
WHERE (qName="chr17" 
    AND (71344493 <= qStart AND qStart <= 71475343
        OR 71344493 <= qEnd AND qEnd <= 71475343))
ORDER BY qStart;
# -> 4248 rows

the query on the net alignment gives (almost) the same output (16 rows), however there are 4248 rows when querying the chain alignment: many hyrax sequences mapping the same mouse region... But I still don't see why the query coordinates are different when using Mm10 as a reference.

PS: if this can help, here is the aligned region in the genome browser.

alignment chain net UCSC • 1.2k views
ADD COMMENT
1
Entering edit mode
7.0 years ago
BlastedBadger ▴ 160

Aaah, alright, I got the answer asking the UCSC genome browser mailing list here: this is actually specified in the doc that query coordinates in chain are relative to the reverse-complemented sequence when the strand is "-". Same for axt.

ADD COMMENT

Login before adding your answer.

Traffic: 1499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6