Handling Massive Filesystems
Spinning platter-based storage is growing to sizes we once never even dreamed of. I still remember my first 286-based computer with a 20MB hard drive, which seemed impressive at the time and could easily hold the hundreds of Wordperfect documents I had created. Now it is quite easy to find multiple terabytes within even a lowly desktop machine.
When you are talking about handling storage subsystems with this kind of capacity, the systems administration game changes slightly. Most of us with the need for this kind of storage will be running 64-bit systems so we don't need to worry about 32-bit limitations anymore but a lot of the tools and supporting systems aren't designed for storage of such magnitude. This article aims to present a few of the items you'll need to start thinking about when dealing with multiple-terabyte data storage subsystems.
If you are building a system for large volumes of data storage there are several issues you still need to deal with:
- Size of drives vs Number of spindles
- If you are building a large capacity storage system you probably have a set amount in TB/TiB which you need. How much I/O capacity will you need to service this amount of storage? If it were possible to have 20TB on a single drive at current read/write speeds it would probably not be that useful as it would still have a relatively low amount of I/O capacity. Weigh up your storage requirements with your I/O capacity - trade off how large the drives are with more or less physical disks. The more physical disks you have, the higher your I/O capacity.
- Platter rotation speed
- Over 10 or more drives, an increase of rotation speed from 7200rpm to 10000rpm, or 10000rpm to 15000rpm will be extremely significant. If you can afford it and you need good I/O performance, investigate high speed drives.
- Cutting Edge technology
- The latest and largest disks may not have received sufficient in-the-wild testing to eradicate hardware bugs, as evidenced by a number of recent high-profile failures of high-capacity drives from several manufacturers. Bear in mind the risks associated with running cutting edge hardware.
- The top-end of any hardware range will typically cost you up to twice as much as the next model down, with only small gains. Do you really need to get that 10% extra, or would the money be better spent on more drives which cost less, for an aggregate gain?
- Use Hardware RAID
- We are big proponents of software RAID at Anchor, but for large amounts of storage you need better management facilities. Hardware RAID will offer you better performance when your storage subsystem is very large. You can also add large write caching with battery backup, which is essential for heavy read loads involving a lot of seeks.
- Make sure the RAID solution you have is flexible enough to satisfy your current and future requirements. It is too easy to purchase hardware that later burns you because it cannot support something as simple as passing through SMART detection to the OS, or being able to control the write-back cache on individual drives.
RAID levels have been documented to death. We will avoid duplicating the same work again here, but we have an informative article which describes the most common RAID levels.
There are some limitations on Linux that you will need to be aware of when dealing with very large disks:
- Standard DOS partition tables (which is what most systems will be using) can only handle disks up to 2TB. Therefore if you have set up your massive RAID to be anything larger than this, the partition table will need to be in the GPT format.
- The GRUB bootloader than most Linux systems use does not currently support GPT partition tables. A later release of GRUB is in the works that does support GPT but this requires a feature known as EFI. For the moment, it is easiest to have a disk smaller than 2TB to boot from, and have the remaining data in a larger logical disk.
The fdisk utility installed on most Linux systems does not support GPT partition tables, so you will need to use a utility called parted to create partitions on your large logical disks.
As mentioned previously, you should check that your hardware RAID solution is flexible enough to work around these sorts of problems. For example, if it does not allow you to create multiple arrays per disk you will find that you need to allocate at least two disks to a completely separate RAID1 volume to have a redundant boot volume. This is assuming your disks are smaller than 2TB. At such point that disks larger than 2TB are available, multiple arrays per disk will become a necessity to not permanently waste space on the disks.
Hopefully at this time, standard Linux utilities in the major supported distributions such as RHEL and SLES will support GPT partition tables and disks larger than 2TB.
Recent versions of Windows Server do not encounter the same problems, as far as we know.
Filesystems are often an issue of much contention, but rightly so. You need to select the most appropriate filesystem not only for your application, but also bearing in mind limitations of the filesystems available and real-world limitations of the hardware:
- With a default block size of 4KB, Ext3 can only support 2^31-1 (around 2 billion) blocks - which equates to about 8TB of storage. To make use of the full 16TB you will need to use 8KB blocks, however standard i386/x86_64 architectures only support 4KB page size (and hence block size). As far as we know, Alpha is the only architecture to support 8KB page size so most of us are stuck with 8TB Ext3 as a maximum.
Ext4 supports filesystems up to 1EB, but is not integrated yet into most of the commercially supported distributions of Linux.
ReiserFS supports filesystems up to 16TB, but is also not typically integrated into most commercially supported distributions of Linux. CentOS users can add the CentOS-Plus repository to gain a ReiserFS-enabled kernel and the reiserfs-utils package.
- Maximum file size is another common element you will need to be aware of for your chosen filesystem. It is rare that applications push the limits of maximum file size but be aware of any limitations if your application will need to use very large files.
- Consider the structure of your data. If you will be accessing a lot of files you will need a filesystem that efficiently handles this scenario. ReiserFS is a typical example of a filesystem that is tuned to handling large filesystem structures with a lot of files.
One frequently neglected aspect when dealing with large filesystems is the time to repair aspect. Any one of the above filesystems may satisfy your operational goals, but if there is a system crash and the filesystem needs to be checked, a lot of filesystems will require an unacceptably large amount of time to check. There are some strategies to deal with this:
- Split up your filesystems into smaller, manageable chunks. This helps alleviate issues to do with filesystem size limitations, but also will cut down on the time required to check any one of the individual filesystems.
Consider a next-generation filesystem such as Btrfs, which has online filesystem checking and very fast offline filesystem checking which works by dividing the overall filesystem into smaller chunks which have their own self-contained consistency markers.
If you are in some doubt as to what will be best for your application, it is wise to try benchmarking the different configuration options you have shortlisted. Once you have the hardware it is usually quite easy to play around with different RAID levels and filesystems in order to get a feel for what will work best for you. We have recently added an article on I/O Benchmarking which should help you in this regard.