Efficient File Storage

Everything that your website does is, at the end of the day, controlled and assisted by files on a hard drive. All of the static files that are served up to your users, the code and data that you read to make up your dynamic pages, and the user-generated data that you store, are all ultimately kept on a disk.

The problem with this is that disks are spectacularly slow, relatively speaking. We've got CPUs that can perform billions of operations every second, and even memory that can transfer gigabytes of data per second. What about the drives? They can't provide those high transfer rates, and it takes ages to travel from one end of the disk to the other to retrieve a different piece of data. What's a poor developer to do?

This article will explain (at a basic level) how filesystems work, and will describe some techniques that you can use to limit the impact of these vital but rather slow elements of your web hosting server.

The Secret Life of Files

Just as it's hard to performance-tune a car without knowing how the engine works, knowing what goes on "under the hood" of your filesystem and disks is important if you want to get the best performance out of your server.

Why the disk is slow

If you're coming at this topic cold, consider perusing the How Stuff Works article on hard disks, and for lots of in-depth info Wikipedia is always helpful.

There's a couple of reasons why the disk is slow. The first is how fast we can get data off a disk platter and into the computer, assuming that it's all in a line ready to go. It's a materials problem -- you can't make a disk spin faster than a certain speed, because it'll just tear itself to pieces because it isn't strong enough to hold together. That's a limit to the transfer rates you can achieve, because you can only get so many bits of of data under the disk heads per second.

The other issue is known as "seek time". We don't always want to read the contents of the disk from beginning to end. It's more like reading the dictionary, in that we jump to one entry we want to know about, then we leap somewhere else. In your disks, this jumping around is done by the disk heads, and like the platters, the heads can only move so fast. It's like flipping through the pages of the dictionary to find the entry you're looking for -- it's always going to take a certain amount of time to make that happen.

Seek time isn't so much of an issue when you're reading a big contiguous chunk of data off the disk, but your average webserver isn't doing much of that -- between different processes wanting to get at different data at the same time, and lots of requests coming in for different files, it's rare that a disk would get much of a chance to read a big chunk of data at once.

Filesystems 101

A filesystem is a means of organising and arranging data on disk, so that when you want to get a file back out again you can do so easily, just by knowing a filename. In addition, the filesystem stores useful metadata about your files, like permissions and ownership.

Different filesystems achieve this in different ways, but in general there's some sort of index that records where on the disk each file actually resides, what directory it's in, and that sort of thing.

To make life a bit easier, the disk is divided up into lots of little blocks, and all of the data is contained in these blocks. If you've got a large file, it'll be chopped up into several blocks before being written to disk. The filesystem's indexes are also kept in blocks, and they contain lists of the blocks that all of the files are in. Operations on the filesystem is always in terms of blocks -- "read a block from here", "write this data to those 3 blocks there", that sort of thing. If you're dealing in titchy little chunks of data in your application, rest assured that underneath there's some much bigger chunks of data being read and written.

The indexes and directory trees on the filesystem are stored as various sorts of lists and trees. If you've ever studied algorithms and data structures, you'll know the sorts of things that are going on behind the scenes. The problem is that there are many different possible data structures that can be used, and a large part of the job of the filesystem designer is to make tradeoffs between making a certain operation fast and making something else a bit slower, by choosing the right data structure to use for a given part of the filesystem layout. The idea is to make the common operations and usage modes fast, at the expense of things that don't get done very often.

To make your disk usage fast, you need to know what the filesystem's idea of what "common" and "rare" mean to the designer of the filesystem. If you're using a filesystem designed for the rapid storage and access of a small number of very large files, then trying to store a couple of million tiny files isn't going to work very well, because in order to make large file access fast, the designer will have had to make compromises that slow down small files.

Disk Caching

Since disks are quite slow, we use caching to try and limit the impact of this problem. Aspects of caching include:

  • trying to keep chunks of the disk's contents in memory instead of having to re-read them off the disk every time you want to get access to it, on the assumption that what gets read once will likely want to be read more than once;
  • prefetching more of a file than you've asked for at the moment on the assumption that you'll want the rest later; and
  • storing data to be written to disk in memory for a little while before making one big monster write all at once, allowing your program to get on with it's business instead of waiting for the write to complete.

All of these strategies try to use fast main memory as much as possible instead of going to the disk all the time. You should try and work with the cache, rather than against it.

What Makes for Slow Disk Operations

In general there are a few "rules of thumb" that apply to most of the filesystems that are out there:

  • The time it takes to read a file out of a directory is proportional to the number of files in that directory, and beyond a certain point things get really slow;

  • Deleting a large file takes longer than deleting a small file;
  • Renaming a file or moving a file to another location on the same filesystem is a quick operation;
  • Moving a large file to a location on a different filesystem is slow (as it is, in effect, a copy and then a delete);
  • Re-reading a file lots of times in succession is a fairly fast operation, as is reading the contents of the same directory;
  • Regularly reading a large amount of the same data is likely to be slow if you're reading more data than you have cache memory available;
  • Writing lots of little chunks of data will be slower than writing large chunks;
  • Reading lots of individual small files is slower than reading a single file that has the same contents as all the little files;
  • Disks are slow, avoid them where possible.

Bearing these limitations in mind when you're working out how to store and access your files is very important.

How to speed up your filesystem accesses

Having looked at the general theory, it's time to examine how to put these principles into practice on your website.

Once and Only Once

In the same way that code duplication is considered a bad thing, having duplicate files in your tree isn't a good idea either. In other words, don't copy files from place to place when you need to have the same contents appear in multiple places on the filesystem.

The underlying problem is that, despite the files having identical contents, the filesystem doesn't know that they're the same file, and it'll cache both copies, and waste time reading both copies off disk.

Instead, use symlinks or hard links to make the same file available in multiple places. These are Unix-specific schemes (although Windows has similar-but-different concepts available) to have multiple filenames all point to the same blocks on disk. Since it's the same file contents underneath, the operating system can optimize the file accesses, only cache one copy, and save you a pile of time and hassle on commonly-accessed files.

Finish one job before you start the next

The disk cache has a limited size, and can only be of help if you re-read the same data frequently.

If you're processing a large dataset, therefore, or you think you might need to, select data structures and algorithms that allow you to process your data in small chunks, rather than by going through the whole lot from beginning to end multiple times.

A related issue to this is relational databases and indexes. If you're trying to find some records in a large table, there's two ways it can be done. Either there's an index, which maps the values you're searching for to the location of the record(s) of interest, or the database has to read every row looking for the values you want. Sound familiar? That's right, if your database has to read through the entire table, and that table is too large to fit in your cache, then every time it gets scanned, it'll be slowly read from disk. If the database can use indexes, since they're a lot smaller, they'll probably stay in cache and can be read very quickly.

The moral: INDEX EARLY, AND INDEX OFTEN.

Write data right

When you need to write some data to disk, instead of writing very small chunks at a time, consider assembling it all into one big chunk of stuff to write at once. Consider the following snippet of PHP:

for ($i = 0; $i < 100; $i++) {
        $fp = fopen("powers", "a");
        fwrite($fp, $i);
        fwrite($fp, " * ");
        fwrite($fp, $i);
        fwrite($fp, " = ");
        fwrite($fp, $i*$i);
        fwrite($fp, "\n");
        fclose($fp);
}

This piece of code will issue 600 write calls, each for a tiny amount of data, and will close and open the file 100 times. While the little tiny writes might get optimised together into a larger write (if you're lucky), every time you close the file it's guaranteed that the writes will be made direct to disk, and that will cost disk time while that write happens, so you're guaranteed, with the above code, to be using the disk at least 100 times.

Instead, consider the following snippet:

$fp = fopen("powers", "w");
for ($i = 0; $i < 100; $i++) {
        fwrite($fp, sprintf("%d * %d = %d\n", $i, $i, $i * $i));
}
fclose($fp);

This only writes 100 times, and only closes the file once. With a bit of luck, some of these writes will be cached together, and there will be significant happiness. However, we can go one better, like so:

$out = "";
for ($i = 0; $i < 100; $i++) {
        $out .= sprintf("%d * %d = %d\n", $i, $i, $i * $i);
}
$fp = fopen("powers", "w");
fwrite($fp, $out);
fclose($fp);

This only writes once, and so it's guaranteed to use the least amount of disk I/O. There is a trade-off here between CPU time and memory to assemble the big string versus the decreased write time to the disk, but what's cheaper: a bit more RAM and maybe a CPU upgrade, or a disk that's so blazingly fast its write times are comparable to that of memory?

On a related topic, you should try and append data to the end of a file as much as possible, rather than rewriting it's contents. For various reasons, changing data in the middle of a file tends to "cost" more in performance terms than just dumping a bit more data at the end. So choose data structures and algorithms that make it easy to store most of your data by appending, rather than modification in the middle.

Keep directory sizes small

This is a non-intuitive optimisation, but it's one that can provide some huge boosts in performance in the right circumstances.

Let's say that you store images of all your products on disk, so that they can be retrieved directly by the web server. The image files are all PNGs, and are named after the product code. So, on your web server, you've got something like this:

DOCUMENT_ROOT/
  images/
    products/
      XYZZY1.png
      ABCDE-BLACK.png
      ABCDE-WHITE.png
      ABCDE-GREY.png
      WOMBAT.png
      WORKWEAR-BLUE.png
      WORKWEAR-BLACK.png
      ...

This works fantastically well -- in your HTML, you just put <img src="/images/products/''productcode''.png">, and the image appears nicely in the browser.

Then, one day, a user reports that your images are loading slow. A quick bit of benchmarking shows that it's only the product images that are slow; the other images like your logo and background are coming through blazingly fast, as always.

It turns out that you've got so many products that when the web server tries to get the image file for a product, it's taking quite a while for the operating system to wade through all the files in the directory and retrieve the one that you want.

The trick to employ here is called bucketing or hashing; what you do is, instead of just having all the files in one directory, you make a bunch of subdirectories each of which contains a small part of all the files you've got available. So, in this case, you could take the first two letters of the product code and make directories named that, and then put all of the images that start with those two letters into that directory, like so:

DOCUMENT_ROOT/
  images/
    products/
      XY/
        XYZZY1.png
      AB/
        ABC-BLACK.png
        ABC-WHITE.png
        ABC-GREY.png
      WO/
        WOMBAT.png
        WORKWEAR-BLUE.png
        WORKWEAR-BLACK.png
      ...

What this does is limits the number of files in each directory. Having lots of files overall is rarely a problem; it's when they all congregate together in the same directory that the problems occur, as the filesystem needs to scan through lots of entries to find the file you're looking for.

If you've got so many products that one or more of your subdirectories is getting slow, you can just spread things out a bit more by creating a second level:

DOCUMENT_ROOT/
  images/
    products/
      XY/
        ZZ/
          XYZZY1.png
      AB/
        C-/
          ABC-BLACK.png
          ABC-WHITE.png
          ABC-GREY.png
      WO/
        MB/
          WOMBAT.png
        RK/
          WORKWEAR-BLUE.png
          WORKWEAR-BLACK.png
      ...

The key point is that you should choose a way of splitting up your files into directories that:

  • Ensures there's never more than about 1,000 files and directories in a single directory; and
  • You can construct the whole directory tree to find a file without having to know anything other than it's name.

This second point is particularly important -- if you need to lookup a filename in some index to find out where it lives, then unless you're really careful, you've probably just moved your scaling problem from the filesystem to your index. Far easier just to break the filename up directly.

With a bit of mod_rewrite trickery, you can limit the amount of code you have to change to the part of your system that writes the files to disk. All that needs to do is just create the directories to put the files into if they don't already exist. Then you add the following magic to your Apache config, and you're away:

<IfModule mod_rewrite.c>
  RewriteEngine On

  RewriteRule ^/images/products/(..)(..)(.*).png$ /images/products/\1/\2/\1\2\3.png [L]
</IfModule>

This will take requests to the old URLs, /images/products/''productcode''.png, and access the file locally under the directory tree that you've defined.

This solution can get a bit tricky if you've got product codes that are shorter than the number of characters you're using to split up the files, but it's not a particularly common problem, and it can be worked around fairly easily.

Avoid using the filesystem

If the disk is slow, why not just avoid using it altogether? While this isn't practical for everything, it's surprising how often you can achieve it with a bit of lateral thinking.

The most common use of disk that you can get rid of is the storage of cached and transient data. Sessions are nearly universal in web applications, and they're frequently modified. Similarly, many CMS and templating systems generate content and cache it to disk, to be accessed later. While it's more efficient to pre-calculate and cache than it is to recalculate everything every time it's requested, on a busy site your disks can end up being saturated with all the writes, which is a death-knell for performance.

Probably the most common tool for caching this data somewhere other than the filesystem is memcached, a simple daemon that is like a big hash table that you stick data into for later retrieval. As the name suggests, it's all stored in memory, and as such is going to be a darn sight quicker than putting it all on disk.

When trying to avoid using the filesystem, though, be wary of "hidden" filesystem use. It's easy to forget that things like databases actually use the disk too. Just because you're not calling write() yourself, doesn't mean that the disk isn't being hammered.