An introduction to SQL query optimisation

If you run or manage a website, chances are there's a database behind it. Web content and applications are increasingly more dynamic nowadays, meaning there's a strong reliance on a well-performing database if you're to handle a growth in the number of users. You can go a fairly long way with a sub-par database, but you'll eventually hit a point where load times become unacceptable and responsiveness starts to decline for all your users. This can be due to a large number of users, or perhaps a growing dataset that gets too large to process efficiently.

We've historically found that customers on one of our dedicated servers ask about hardware upgrades at this point, and it's not hard to see why. Upgrades are modestly priced, generally quite quick to get done, and makes things a bit easier on the app. The problem is that it doesn't scale. At absolute best you'll get a linear gain from the hardware upgrade, and the costs increase far quicker than the specs of your hardware. This is why it's important to look at your application when dealing with growth.

We've seen even simple optimisations bring improvements of over an order of magnitude, and the cost is a modest once-off investment. It just doesn't make sense not to consider optimisation before forking out for more hardware.

Who is this article for, and what are we doing?

This article will be of most use to a developer who's building or maintaining a database-driven site. If you oversee a site but don't have to manage the data or code, that's okay too; knowing what to look out for is also very useful and can help you plan for the future.

As you might already know, we prefer to use PostgreSQL over MySQL. That said, we see a lot more people using MySQL, so we'll try to cover them both equally.

If you're not sure if this applies to you, read on a little further; one of the first things we'll cover is determining whether query optimisation can help you.

We're going to talk about using table indexes to make the query engine and optimiser much more efficient. If you're already "down" with indexes, this probably isn't for you.

Why choose to optimise?

As mentioned above, optimisation can bring vast improvements in performance to your application. While it's not a regular service that we offer, we've done optimisation for customers in the past on a case-by-case basis. In the most recent instance, we spent about 90 minutes performing analysis and making a few changes and recommendations. This reduced some page load times from ~35sec to under a second.

Let's contrast this with the hardware upgrade route. It's almost difficult to buy a CPU running at less than 2GHz nowadays. A cheapie celeron processor might set you back some $80-100. The E1400 model fits the bill, a dual-core chip running at 2GHz with 512K of L2 cache. As an upgrade, you could pick the core2duo E8600 for about $450. It's also a dual-core chip, running at 3.33GHz and has 6MB of L2 cache. It's objectively a much better CPU, thanks to the extra L2 cache and solid bump in speed. Whether it's at least four times better for you is a different matter, and it's a hard question to answer without testing both. Even being optimistic and assuming an eight-fold improvement, the optimisation was still four times better (and cheaper as well, unless your developer charges a couple hundred dollars an hour).

That might be a good upgrade after all, but what do you do when you need to grow even more? You can probably shift to a CPU with more cores. If you move to expensive server-grade hardware you can have multiple CPUs (and lots more RAM), but this simply doesn't work for most workloads. Keeping Amdahl's law in mind, the returns quickly start diminishing as you go further. Even ignoring this, the costs are prohibitive for all but the biggest services. If you're serious and need more than two CPUs then you start to pay thousands of dollars per CPU. Optimising your code is suddenly compelling. :)

Choice of storage engine

If you are using MySQL, one of the best and easiest things you can do is switch to using the InnoDB storage engine. My good colleague Saj has already written an article on the MySQL storage engines, but the long and the short of it is that InnoDB offers you better data integrity and performance for concurrent workloads. It's really easy to make the switch, too.

What is your application doing?

Knowing how your application works and what it's asking of the database is key to making it work well. Although we can perform optimisation work, a substantial amount of the billed time is spent finding this out. This is really best done by the developer or maintainer of the site, as they'll have insight into how everything works and where data is coming from.

There's a couple of ways to know what the application is doing. One is to have intimate knowledge of the code, but this is very hard. The easier way is to enable query logging in the database. With this data we can get an idea of what's going on. After making these changes you'll need to reload or restart the database daemon.

MySQL

Logging queries in mysql is easy, just remember that it can severely impact performance, so it's best not to do it on a busy site. These options are usually set in /etc/my.cnf on Redhat-type systems, and `/etc/mysql/my.cnf on Debian-types.

# this will log all queries
log = /var/log/mysql/mysql.log

# if you're only interested in complex queries that take a long time, you want slow-query-logging; long_query_time is measured in seconds
# the last option is particularly interesting later on when we discuss indexing
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 2
log-queries-not-using-indexes

PostgreSQL

Your postgresql.conf contains the directives needed to enable logging. On Debian-type systems it's in /etc/postgresql/version.number/main, on Redhat-type it's in /var/lib/pgsql/data. You'll want to modify these lines as appropriate for your setup.

log_min_duration_statement = 0  # log queries longer than this, similar to mysql's slow query log
log_statement = true            # log executed statements (queries)
log_destination = 'syslog'      # these three control where the logs end up, you'll need to make sure they're appropriate for your system
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'

Analysing the queries

Once you have a log of the queries being run you can start to work out what the database is doing. I'll assume you're able to open a command-line session to the database, though you should also be able to use tools such as phpMyAdmin or phpPgAdmin. In both cases, you can use EXPLAIN to get the details you need, you just run the queries that you're interested in or suspect may be slow. The best candidates for this are queries that take a long time and/or are run frequently. A modest saving on a common query can add up to a significant benefit.

Finding which queries are most in need of optimisation can be a bit tricky. If you're running PostgreSQL you can use pgFouine to help get an idea of what queries best satisfy the "take a long time and/or are run frequently" requirement.

MySQL

We'll use a somewhat interesting query to illustrate, this matches information about suburbs with streets in suburbs.

mysql> EXPLAIN EXTENDED SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077');
+----+-------------+---------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows  | Extra       |
+----+-------------+---------+------+---------------+------+---------+------+-------+-------------+
|  1 | SIMPLE      | subcode | ALL  | NULL          | NULL | NULL    | NULL | 12550 | Using where | 
|  1 | SIMPLE      | street  | ALL  | NULL          | NULL | NULL    | NULL | 11705 | Using where | 
+----+-------------+---------+------+---------------+------+---------+------+-------+-------------+
2 rows in set, 1 warning (0.01 sec)

PostgreSQL

As you can see, postgres is a bit more verbose about what it's doing.

pgsql=# EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077');
                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..702.84 rows=1 width=29)
   Join Filter: (subcode.scid = street.scid)
   ->  Seq Scan on subcode  (cost=0.00..360.25 rows=1 width=19)
         Filter: (((sub)::text = 'HORNSBY HEIGHTS'::text) AND (code = '2077'::bpchar))
   ->  Seq Scan on street  (cost=0.00..342.57 rows=1 width=14)
         Filter: (((street.street)::text = 'GALSTON'::text) AND ((street.streettyp)::text = 'RD'::text))
(6 rows)

What's going on here?

What you're seeing is part of the workings of the query engine. It's the engine's job to parse the SQL and turn it into a sequences of operations to retrieve the requested data. You'll notice there's a couple of sections to each output. What happens is that the engine pulls up data from the two tables in the query, then merges the results together to produce the final results set.

sql-merge.png

We'll discuss the specifics of the selection and merge process, but before that, we need to learn about indexes.

Indexes

Most programmers that deal with databases will have heard of indexes, they make your database go faster. While database design and theory is well out of the scope of this article, it suffices to know that indexes represent a structured form of a subset of your data. When searching for a row in a table, you're generally looking to match a specific key. A key can be whatever you want, though it's usually a column that's unique for every row in the table.

It's generally faster to consult the index, assuming the index is for a key column. If the column you're using as a search key doesn't have an index, it has to check every single entry to find what it's looking for. In a big table, this is very slow and takes progressively longer as the table grows in size. With an index, the lookup time is close enough to constant.

One thing to keep in mind is that indexes take up space and will slow down UPDATE and INSERT queries somewhat, as the engine needs to update all the indexes appropriately. In reality this is probably not a significant issue, and indexes tend to be very good "value" for the extra time they take. You just don't want to keep extra indexes that aren't being used.

How indexes are used

When the engine needs to retrieves rows from a table based on a key, it will generally refer to the index. Assuming the index matches your search key, it can find all the matching entries in the index, then follow these pointers to the full rows in the table. Because the index is much smaller and well-organised, this is very fast. The index can be used any time the engine needs to find rows with a matching search key.

How lookups work (SELECT)

In a table without indexes the engine simply starts at the top and reads every row, pulling out the ones that match your query. This is called Sequential Scanning. The engine may also do this for small tables, where it's faster to just read the table, rather than reading the index and then the table. Postgres will also perform a Seqscan if it thinks it's going to return more than a certain proportion of the table; in this case it's actually more efficient.

When an index is present, it can be used to find and return matching rows very quickly, as described above. In addition, the index can also be used in an ordered manner, eg. to find all rows where a particular column is greater-than or less-than a given value.

How joins work ( [INNER|LEFT|RIGHT|OUTER] JOIN )

As the diagram above showed, a JOIN operation merges data from two or more tables. To do this, the data must be sorted on the columns that are being joined. The sorted columns are then merged together, selecting matching rows and discarding unneeded ones. An index on the joined columns means the sorting is trivial; sorting things is generally computationally expensive (compared to the merging), so the index is a big win.

Some of this will depend on your exact query, but the overall pattern holds true; data is retrieved from tables, sorted, merged, possibly sorted again, then further cut down if requested.

How to use this information

In case you haven't guessed, the point here is that you should add indexes to your tables. If you're going to select rows using a key column, that column should have an index. If you're going to be joining tables, the joined columns should have indexes. We'll illustrate this procedure using the query you saw earlier.

EXPLAIN 
SELECT scid,street,streettyp,sub,code 

FROM subcode INNER JOIN street 
  USING (scid) 

WHERE 
(street.street = 'GALSTON') AND 
(street.streettyp = 'RD') AND 
(subcode.sub = 'HORNSBY HEIGHTS') AND 
(subcode.code = '2077');

Breaking this down, we can see there's two tables involved, subcode and street. The join qualifier clause, USING (scid), is a sort of shorthand; it means the join is performed for rows where subcode.scid=street.scid. This means it'd probably be beneficial to have an index on the scid column in both tables. Let's put an index on those and see what happens.

Indexing the join

MySQL

Interestingly, there's no duplicates in this index. When you can, adding the constraint of UNIQUE is a good thing, as it gives the engine more info to work with.

mysql> CREATE INDEX subcode_scid_idx ON subcode (scid);
Query OK, 12550 rows affected (0.06 sec)
Records: 12550  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077');
+----+-------------+---------+------+------------------+------------------+---------+-------------+-------+-------------+
| id | select_type | table   | type | possible_keys    | key              | key_len | ref         | rows  | Extra       |
+----+-------------+---------+------+------------------+------------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | street  | ALL  | NULL             | NULL             | NULL    | NULL        | 11705 | Using where | 
|  1 | SIMPLE      | subcode | ref  | subcode_scid_idx | subcode_scid_idx | 4       | street.scid |     1 | Using where | 
+----+-------------+---------+------+------------------+------------------+---------+-------------+-------+-------------+
2 rows in set (0.00 sec)

It only had to retrieve one row from the subcode table, that's a good thing! The output identifies the indexes that were availble, and what it ended up using. The "Extra" column tells you more about how the rows are selected.

Postgres

Here you can see that it now does an Index Scan for the subcode table, using the index we defined. If you're sharp, you'll also see that the cost has been roughly halved. The cost is measured is the number of "pages" the engine had to fetch from the datastore, and comprises two numbers. The first is how long you have to "wait" until it can start delivering the first bits of data, the second is the total for all the data. The numbers themselves don't mean much, and you'll generally only see a couple of scenarios, at least in the early parts of query:

  1. The first number is practically zero and the second is something higher - this means it's probably reading most of a table and doesn't need to do any sorting
  2. The first and second number are non-zero and fairly close to each other - the engine probably had to read a whole table then sort the results before delivering them
  3. The first number is non-zero and the second number is somewhat higher - you might be doing a join; the results of part of the query are returned, then the engine has to wait on another table for its results.

pgsql=# CREATE INDEX subcode_scid_idx ON subcode (scid);
CREATE INDEX
pgsql=# EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077');
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..350.86 rows=1 width=29)
   ->  Seq Scan on street  (cost=0.00..342.57 rows=1 width=14)
         Filter: (((street)::text = 'GALSTON'::text) AND ((streettyp)::text = 'RD'::text))
   ->  Index Scan using subcode_scid_idx on subcode  (cost=0.00..8.27 rows=1 width=19)
         Index Cond: (subcode.scid = street.scid)
         Filter: (((subcode.sub)::text = 'HORNSBY HEIGHTS'::text) AND (subcode.code = '2077'::bpchar))
(6 rows)

Index both tables

We'll go ahead and add another index to the street table, then check the results again.

Mysql doesn't seem to care about the index we've added. Perhaps it deems that the table is small enough not to bother.

mysql> EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077') ;
+----+-------------+---------+------+------------------+------------------+---------+-------------+-------+-------------+
| id | select_type | table   | type | possible_keys    | key              | key_len | ref         | rows  | Extra       |
+----+-------------+---------+------+------------------+------------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | street  | ALL  | street_scid_idx  | NULL             | NULL    | NULL        | 11705 | Using where | 
|  1 | SIMPLE      | subcode | ref  | subcode_scid_idx | subcode_scid_idx | 4       | street.scid |     1 | Using where | 
+----+-------------+---------+------+------------------+------------------+---------+-------------+-------+-------------+
2 rows in set (0.00 sec)

Postgres is also underwhelmed by the index. It's entirely possible that the table is small enough that it doesn't bother with the index. Even so, I'd leave the index in place. It won't slow things down at this size, and it's good for when things grow.

pgsql=# EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode LEFT JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077') ORDER BY hregion;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Sort  (cost=350.87..350.88 rows=1 width=180)
   Sort Key: subcode.hregion
   ->  Nested Loop  (cost=0.00..350.86 rows=1 width=180)
         ->  Seq Scan on street  (cost=0.00..342.57 rows=1 width=78)
               Filter: (((street)::text = 'GALSTON'::text) AND ((streettyp)::text = 'RD'::text))
         ->  Index Scan using subcode_scid_idx on subcode  (cost=0.00..8.27 rows=1 width=106)
               Index Cond: (subcode.scid = street.scid)
               Filter: (((subcode.sub)::text = 'HORNSBY HEIGHTS'::text) AND (subcode.code = '2077'::bpchar))
(8 rows)

What next?

What we've seen here is that the indexes make the query engine more efficient, but I suspect we can do better. You can see that postgres seqscans the street table, filtering out the rows it doesn't need. It then uses the subcode_scid_idx index to join the filtered rows against the subcode table, filtering more rows as it goes.

This filtering is also a bit inefficient, as we're possibly retrieving more data than we need. To improve this, we can add an index on the columns being filtered.

Postgres

This makes a substantial difference, and the "shape" of the query has changed. Instead of seqscanning subcode then joining this against street, the index is used to directly retrieve keyed rows from subcode, then joined to street. The subquery on street has also changed to use an index.

You'll also notice that the costs have dropped. The minimum has increased above zero as the subquery on street has to finish before we can get data back, but the top end is now half the size. A little more indexing should bring good improvements.

pgsql=# CREATE INDEX subcode_code_idx ON subcode USING btree (code);
CREATE INDEX
pgsql=# EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077') ;
                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=4.60..129.73 rows=1 width=29)
   ->  Index Scan using subcode_code_idx on subcode  (cost=0.00..19.60 rows=1 width=19)
         Index Cond: (code = '2077'::bpchar)
         Filter: ((sub)::text = 'HORNSBY HEIGHTS'::text)
   ->  Bitmap Heap Scan on street  (cost=4.60..110.12 rows=1 width=14)
         Recheck Cond: (street.scid = subcode.scid)
         Filter: (((street.street)::text = 'GALSTON'::text) AND ((street.streettyp)::text = 'RD'::text))
         ->  Bitmap Index Scan on street_scid_idx  (cost=0.00..4.60 rows=47 width=0)
               Index Cond: (street.scid = subcode.scid)
(9 rows)

As a final test, we'll add a few more indexes and see what happens.

pgsql=# CREATE INDEX subcode_sub_idx ON subcode USING btree (sub);
CREATE INDEX
pgsql=# CREATE INDEX street_streettyp_idx ON street USING btree (streettyp);
CREATE INDEX
pgsql=# EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077') ;
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Nested Loop  (cost=23.80..46.29 rows=1 width=29)
   ->  Index Scan using subcode_sub_idx on subcode  (cost=0.00..8.27 rows=1 width=19)
         Index Cond: ((sub)::text = 'HORNSBY HEIGHTS'::text)
         Filter: (code = '2077'::bpchar)
   ->  Bitmap Heap Scan on street  (cost=23.80..38.01 rows=1 width=14)
         Recheck Cond: ((street.scid = subcode.scid) AND ((street.streettyp)::text = 'RD'::text))
         Filter: ((street.street)::text = 'GALSTON'::text)
         ->  BitmapAnd  (cost=23.80..23.80 rows=4 width=0)
               ->  Bitmap Index Scan on street_scid_idx  (cost=0.00..4.60 rows=47 width=0)
                     Index Cond: (street.scid = subcode.scid)
               ->  Bitmap Index Scan on street_streettyp_idx  (cost=0.00..18.95 rows=893 width=0)
                     Index Cond: ((street.streettyp)::text = 'RD'::text)
(12 rows)

We've managed to cut our maximum cost in half, and then some. The minimum has come up again, but this is quite acceptable. Let's check our indexes so we know what we've got.

pgsql=# \di
                    List of relations
        Name         | Type  | Table  
---------------------+-------+---------
street_scid_idx      | index | street
street_streettyp_idx | index | street
subcode_code_idx     | index | subcode
subcode_scid_idx     | index | subcode
subcode_sub_idx      | index | subcode

You'll notice that we've got an index on subcode.code but it wasn't used. The engine probably figured it was cheaper to just filter rows from an already-small resultset. If you want to check, you need to know a bit more about your data.

pgsql=# SELECT COUNT(DISTINCT code) AS distinct_codes, COUNT(code) AS num_codes FROM subcode;
 distinct_codes | num_codes 
----------------+-----------
           2604 |     12550

pgsql=# SELECT COUNT(DISTINCT code) AS distinct_codes, COUNT(code) AS num_codes FROM subcode WHERE subcode.sub = 'HORNSBY HEIGHTS';
 distinct_codes | num_codes 
----------------+-----------
              1 |         1

An index on subcode.code is probably a good thing, but it's not much help after the index scan for HORNSBY HEIGHTS returns just 1 row anyway.

MySQL

MySQL unfortunately doesn't show the cost of a query, but we can see that the indexes are being used. The engine now retrieves a few dozen rows from street instead of 11,000 of them, and the order of the subqueries has flipped.

mysql> CREATE INDEX subcode_code_idx ON subcode (code);
Query OK, 12550 rows affected (0.08 sec)
Records: 12550  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077') ;
+----+-------------+---------+------+-----------------------------------+------------------+---------+--------------+------+-------------+
| id | select_type | table   | type | possible_keys                     | key              | key_len | ref          | rows | Extra       |
+----+-------------+---------+------+-----------------------------------+------------------+---------+--------------+------+-------------+
|  1 | SIMPLE      | subcode | ref  | subcode_scid_idx,subcode_code_idx | subcode_code_idx | 5       | const        |    3 | Using where | 
|  1 | SIMPLE      | street  | ref  | street_scid_idx                   | street_scid_idx  | 5       | subcode.scid |   44 | Using where | 
+----+-------------+---------+------+-----------------------------------+------------------+---------+--------------+------+-------------+
2 rows in set (0.00 sec)

Just to be fair, we'll add the extra indexes and see what happens.

mysql> CREATE INDEX subcode_sub_idx ON subcode (sub);
Query OK, 12550 rows affected (0.09 sec)
Records: 12550  Duplicates: 0  Warnings: 0

mysql> CREATE INDEX street_streettyp_idx ON street (streettyp);
Query OK, 11705 rows affected (0.11 sec)
Records: 11705  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT scid,street,streettyp,sub,code FROM subcode INNER JOIN street USING (scid) WHERE (street.street = 'GALSTON') AND (street.streettyp = 'RD') AND (subcode.sub = 'HORNSBY HEIGHTS') AND (subcode.code = '2077') ;
+----+-------------+---------+------+---------------------------------------------------+-----------------+---------+--------------+------+-------------+
| id | select_type | table   | type | possible_keys                                     | key             | key_len | ref          | rows | Extra       |
+----+-------------+---------+------+---------------------------------------------------+-----------------+---------+--------------+------+-------------+
|  1 | SIMPLE      | subcode | ref  | subcode_scid_idx,subcode_code_idx,subcode_sub_idx | subcode_sub_idx | 53      | const        |    1 | Using where | 
|  1 | SIMPLE      | street  | ref  | street_scid_idx,street_streettyp_idx              | street_scid_idx | 5       | subcode.scid |   44 | Using where | 
+----+-------------+---------+------+---------------------------------------------------+-----------------+---------+--------------+------+-------------+
2 rows in set (0.00 sec)

This improved the subquery on subcode, but it's hard to get much more information that that.

Regular maintenance

Now that we're using indexes, they need to be kept up to date if they're to effectively help the query engine. Postgres keeps a lot of statistical data about how the table is used and what kinds of queries get run most often, which is fed to the Genetic Query Optimiser (GEQO). MySQL uses other table metadata to determine the efficacy of the indexes and when to use them.

Whichever database you use, the process is fairly similar. You should regularly VACUUM your tables to remove dead rows, and ANALYZE them to ensure your metadata is relevant.

Postgres

You can add VERBOSE after the ANALYZE if you want comprehensive output about what the engine does

# process one table
pgsql=# VACUUM ANALYZE tablename;

# process all tables in a database
pgsql=# VACUUM ANALYZE;

MySQL

# check one (or more) tables
mysql> ANALYZE TABLE tablename, another_table, yet_another_table ;

# check all tables (from the regular command line, instead of the mysql shell; you may need the mysql root password)
$ mysqlcheck --all-databases -u root -p

Further optimisations

Relational database backends revolve around structured, typed data. Because of this, they'll work best when your data is stored in a structured manner. As an example, it's easy to store a customer's address in an arbitrary-length text field, but this isn't very efficient. You'd be better off splitting it up to separate out things like the suburb, postcode, country, etc. As a bonus, you now have more data readily accessible, allowing you to answer questions like "where are most of my customers located?".

Most applications, particularly ones written in PHP, make a connection to the database for every page hit. This is fine, but it means you really want to touch the database as little as possible. Reducing the number of queries you make by aggregating results before returning them is a good way to take some load off the database. Similarly, if you can cut rows out before returning them to your application then you should do so. Selecting the contents of an entire table, then discarding all but one row when it gets to your application is silly.

If you use postgres, you can CLUSTER your data on a column. By putting similar data next to each other, a SELECT that keys off that column can fetch the relevant rows in one operation, rather than having to move around on the disk (if you're curious, postgres stores and retrieves rows in 8KiB pages. Putting related rows into fewer pages means less disk activity). This should produce a noticeable reduction the costs you saw earlier when running EXPLAIN.

As a final point, it pays to be smart about your choice of column types and the criteria you use in your WHERE clauses. If you're working with postcodes, it's more efficient to store them as integers rather than a string of characters. If you're SELECTing from the column, a numerical greater-than or lesser-than will be far more efficient than text comparisons like regular expressions or LIKE.

Author

Barney Desmond is a Linux systems administrator at Anchor with a passion for free software and open source solutions. Anchor is a provider of Australian web hosting and dedicated servers.

Resources