Implementing efficient Geo IP location system in MySQL

Often application needs to know where a user is physically located. The easiest way to figure that out is by looking up their IP address in a special database. It can all be implemented in MySQL, but I often see it done inefficiently. In my post I will show how to implement a complete solution that offers great performance.

Importing Geo IP data

First you will require a database mapping network addresses to real locations. There are various resources available, but I chose the one nginx web server uses with its geoip module. GeoLite City comes in CSV format and is available for download with no charge from MaxMind.

The archive contains two files. GeoLiteCity-Blocks.csv lists all IP ranges and maps each one to the corresponding location identifier, while GeoLiteCity-Location.csv contains location details such as country, region or city names. The contents of these files can be important into two MySQL tables. I created the following structures:

CREATE TABLE `geoip_blocks` (
  `gbl_block_start` int(10) unsigned NOT NULL,
  `gbl_block_end` int(10) unsigned NOT NULL,
  `gbl_glc_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`gbl_block_start`,`gbl_block_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `geoip_locations` (
  `glc_id` int(10) unsigned NOT NULL,
  `glc_country` char(2) NOT NULL,
  `glc_region` varchar(2) NOT NULL,
  `glc_city` varchar(64) NOT NULL,
  `glc_zip` varchar(16) NOT NULL,
  `glc_latitude` decimal(7,4) NOT NULL,
  `glc_longitude` decimal(7,4) NOT NULL,
  PRIMARY KEY (`glc_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Please note that I used INT UNSIGNED type for gbl_block_start and gbl_block_end that will carry network addresses. An IPv4 address can be represented either as text or as a 32-bit number. The database I have uses the latter, which is a good thing, because it makes it much smaller and also easier for querying.

The next step is preparing the CSV files for import. As it turns out GeoLiteCity-Location.csv isn't correctly encoded and performing a conversion is required or otherwise data will be broken once loaded into MySQL.

% file -ib GeoLiteCity-Location.csv                                                    
text/plain; charset=iso-8859-1

% iconv -f ISO-8859-1 -t UTF-8 GeoLiteCity-Location.csv > GeoLiteCity-Location-UTF8.csv

% file -ib GeoLiteCity-Location-UTF8.csv                                                    
text/plain; charset=utf-8

Then it can be imported into the tables.

mysql> LOAD DATA INFILE '/home/maciek/GeoLiteCity_20120504/GeoLiteCity-Blocks.csv' 
       INTO TABLE geoip_blocks CHARACTER SET utf8 FIELDS TERMINATED BY ',' 
       OPTIONALLY ENCLOSED BY '"' IGNORE 2 LINES (gbl_block_start, gbl_block_end, gbl_glc_id);
Query OK, 1875384 rows affected (1 min 17.67 sec)
Records: 1875384  Deleted: 0  Skipped: 0  Warnings: 0

mysql> LOAD DATA INFILE '/home/maciek/GeoLiteCity_20120504/GeoLiteCity-Location-UTF8.csv' 
       INTO TABLE geoip_locations CHARACTER SET utf8 FIELDS TERMINATED BY ',' 
       OPTIONALLY ENCLOSED BY '"' IGNORE 2 LINES (glc_id, glc_country, glc_region, glc_city, 
       glc_zip, glc_latitude, glc_longitude, @ignored, @ignored);
Query OK, 353223 rows affected (3.01 sec)
Records: 353223  Deleted: 0  Skipped: 0  Warnings: 0

Be sure to verify that neither query reported any warnings from the execution. Otherwise data was likely imported incorrectly.

Designing the lookup query

The next step is figuring out what query could efficiently map a particular IP address to the corresponding location. I already mentioned that our database uses numbers for network address representation. Any applications that use text format internally will have to perform a conversion first. MySQL has a built-in function called INET_ATON() for mapping a.b.c.d form into a number, but in fact every programming language has something similar as well, so the conversion can be done anywhere.

Bad query

I have seen many approaches to the problem. The most common I find implemented in so many places is a query like this:

mysql> SELECT glc.*
       FROM   geoip_blocks gbl
              JOIN geoip_locations glc
              ON     glc.glc_id = gbl.gbl_glc_id
       WHERE  INET_ATON('149.156.1.4') BETWEEN gbl.gbl_block_start AND gbl.gbl_block_endG
*************************** 1. row ***************************
         glc_id: 57106
    glc_country: PL
     glc_region: 77
       glc_city: Cracow
        glc_zip: 
   glc_latitude: 50.0833
  glc_longitude: 19.9167
1 row in set (0.43 sec)

It seems natural to simply look for a range that contains the one IP address, but it is not very efficient:

mysql> SHOW STATUS LIKE 'Handler_read_%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_read_first         | 1       |
| Handler_read_key           | 2       |
| Handler_read_last          | 0       |
| Handler_read_next          | 1378462 |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_next      | 0       |
+----------------------------+---------+
7 rows in set (0.00 sec)

mysql> EXPLAIN SELECT glc.*
       FROM   geoip_blocks gbl
              JOIN geoip_locations glc
              ON     glc.glc_id = gbl.gbl_glc_id
       WHERE  INET_ATON('149.156.1.4') BETWEEN gbl.gbl_block_start AND gbl.gbl_block_endG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: gbl
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 937963
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: glc
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: geoip.gbl.gbl_glc_id
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

It uses the primary key for filtering in geoip_blocks table, but Handler_read_next from SHOW STATUS output indicates that it reads a lot of rows along the way. MySQL cannot [fully] optimise a condition that involves multiple columns, even if all are part of the same index. So what it does here, it uses the primary key to only match the rows where the address value is larger than gbl_block_start. This much it can do. However it still meant reading roughly 1.4M rows and rejecting all but one. The actual number can vary depending on the IP address value, but overall it is far from being optimal solution.

Good query

What we know about these ranges of IP addresses is that they should be mutually exclusive. I.e. if values from 2510028800 to 2510045951 make one block, none of them can appear in any other block. So to discover into which range an address falls, all we need to find is either the preceding gbl_block_start value or the following gbl_block_end value. This can be done very efficiently:

mysql> SELECT   glc.*
       FROM     geoip_blocks gbl
                JOIN geoip_locations glc
                ON       glc.glc_id = gbl.gbl_glc_id
       WHERE    gbl_block_start <= INET_ATON('149.156.1.4')
       ORDER BY gbl_block_start DESC
       LIMIT    1G  
*************************** 1. row ***************************
         glc_id: 57106
    glc_country: PL
     glc_region: 77
       glc_city: Cracow
        glc_zip: 
   glc_latitude: 50.0833
  glc_longitude: 19.9167
1 row in set (0.00 sec)

The execution time dropped from 0.43 sec to 0.00 sec, but if we compare the execution plans, they will not seem any different for either query:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: gbl
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 937963
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: glc
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: geoip.gbl.gbl_glc_id
         rows: 1
        Extra:

They are indeed the same. But the way MySQL accessed the rows in this new query was completely different:

mysql> SHOW STATUS LIKE 'Handler_read_%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 0     |
+----------------------------+-------+

In this case MySQL simply starts reading rows backwards in the descending order from the largest value that matches gbl_block_start <= INET_ATON('149.156.1.4'). Applying LIMIT 1 allows it to stop execution immediately after finding the first matching row. The goal for our query was to find a single row with the largest possible value, but no larger than INET_ATON('a.b.c.d'), and this is precisely where this query began its execution and where it ended - perfect efficiency, zero overhead.

Blog