Per-domain Traffic Reporting and Usage Analysis for Web, E-mail and FTP

IP data accounting is not usually a terribly tricky feat to pull off. All traffic that traverses the Internet (or any computer network for that matter) is merely a flow of distinct packets (or datagrams), that, when put together at the receiver's end, magically come together to form a web page or e-mail message. Each packet put out onto a network contains at least a header and a payload. The payload is simply the part of the packet that contains your data; or rather, a small fragment of it. The header is a special data structure tacked on by your server's operating system or an intermediate router that describes various properties of the packet, including where it came from and where to send it to. Of particular relevance to this article, the header also contains a fixed field that describes the size of the packet.

Most service providers will simply capture or tap your flow of packets at their network border and use software1 to peer into the headers to determine how much data you are pushing or pulling across their transit links. This data is then fed into huge database systems for later analysis. This is IP data accounting and is the basis of most service provider billing models.

The problem

While IP accounting is highly accurate and reasonably kind on computing resources, it can only be applied in situations where you are able to differentiate between your customers based solely on the information present in the packet headers. In other words, each one of your customers must be bound to a dedicated set of IP addresses.

At Anchor, we routinely support resellers who have one or more managed dedicated servers hosted with us and run into this problem when it comes to billing their clients for their data usage. The traditional approach to the problem was simple: ask Anchor for a sufficiently large block of IP addresses and allocate one for each client site hosted on the reseller's dedicated server. By ensuring that all of a client's traffic hits a single IP address (this is implemented by fiddling with a server application's bind address and the site's DNS records) the standard Anchor traffic reports mailed out to our dedicated server customers can be trivially factored into a reseller's billing calculations.

Unfortunately, these days are long gone. Internet Registries like APNIC — organisations responsible for the division and allocation of the IP address space to their customers (Anchor included) — have developed a low tolerance for wasteful practices such as the one described in the previous paragraph. IPv4 address exhaustion is the Internet service provider's Oil Crisis. We simply have progressively fewer of these addresses to hand out and are under constant pressure to keep unnecessary allocations to an absolute minimum. Continuing the oil crisis analogy, network protocol architects have, for quite some time now, had measures in place to catapult our Internet into its own Green Economy. IPv6 is the next incarnation of IP (the Internet Protocol). Arguably the most anticipated improvement of IPv6 is its significantly larger address space: 296 times bigger than the current address space provided by IPv4, not taking the substantial amount of administratively reserved IPv4 addresses into account. In case you were wondering, 296 is about 80,000,000,000,000,000,000,000,000,000, or eighty quintillion. But like high-capacity, efficient and cost-effective batteries, IPv6 is still some time away from widespread adoption. In the meantime, the rest of the world is reduced to cutting the emissions on their internal combustion engines and bicycling to work: Anchor, as a service provider, no longer allocate IPv4 addresses for the sole purposes of billing.

The solution

Forget IP accounting. Yes, it's convenient and easy to implement, however in times of hardship, alternative methods are often required.

All respectable server applications (whether they be serving web, mail or FTP content) are capable of logging, in real-time, what they are up to. What interests us is the ability to pluck out — from these application logs — figures that can be deterministically attributed to a client's total data transfer with all client sites hosted on a single, shared IP address. On Anchor managed Linux dedicated servers, this is not much of a stretch:

  • All of our supported server software (including, but not limited to, the Apache HTTP Server, the Postfix Mail Transfer Agent and the vsftpd FTP server) support a high degree of real-time client-activity logging.

  • Client sites are entirely contained within their own home directories and user accounts.
  • Thus, it is reasonably trivial to tie log data to your clients based on the domain names and user account names dropped in each message of the logs. All you require is a maintainable mapping from (domain|user account) to client, a mechanism to collect and store accounting data and another to collate and report on this data.

The remainder of this article will provide some basic tools necessary for such a data accounting system and some rough guidance on their implementation. If you are an Anchor dedicated reseller, please get in touch with us to have this reporting system deployed and tested for you by one of our qualified Systems Administrators.

The disclaimer

Before we go on to discuss the software tools below, please note that application log-based traffic accounting is inaccurate by IP accounting standards. We can never hope to account for every last byte of traffic that traverses a network link simply by analysing a high-level application log. At best, this alternative method should be thought of as an approximation. The tools below will always give you a number lower than what an IP accounting suite would provide. (If factored into your billing calculations, you will not over-bill your customers when using these tools as intended.) Additionally, the tools below are unable to break traffic down into inbound and outbound components. Figures provided in reports are to be taken as a total for both inbound and outbound traffic.

The tools

The collection of scripts you can use to implement the accounting method discussed in this article can be downloaded as a gzip-compressed tarball from here: anchor-traffic.tar.gz

When all is said and done, you will be mailed automated reports not unlike the following (domain names removed to protect the innocent):

sample_report.png

PostgreSQL

The data acquisition scripts provided here will only inject data into a PostgreSQL database back-end. No plans currently exist to support MySQL or SQLite, although such modifications would be reasonably trivial with the use of a database abstraction layer.

Here's a visual representation of the simple database schema in use:

schema.png

init_db.sql

This file contains the SQL statements you will need to invoke against Postgres in order to initialise your new database with a suitable set of tables.

traf_add_user.pl

Most system logs store user name information only. As our traffic reports are all generated on a per-domain basis, we require a means of relating user names (or UNIX system accounts) to domain names. (Remember, we are operating under the assumption that each virtual host configured on the server is its own system account.) This script sets up such an association for new system accounts. It should be run once for every virtual hosting system account already configured on the server and once for each new system account added to the server in the future.

Usage:  traf_add_user.pl USER DOMAIN 

traf_apache.pl

Parses your Apache HTTP Server access logs and injects accounting data into the storage back-end for future reporting.

Usage:  traf_apache.pl ACCESSLOG 

traf_dovecot.pl

Parses your Dovecot mail logs and injects accounting data into the storage back-end for future reporting.

Usage:  traf_dovecot.pl MAILLOG 

traf_vsftpd.pl

Parses your vsftpd transfer logs and injects accounting data into the storage back-end for future reporting.

Usage:  traf_vsftpd.pl XFERLOG 

traf_rpt_monthly.pl

Generates a human-friendly traffic report for each domain hosted on your server. These reports are designed to be automatically generated by your server and mailed out to you on a periodical basis (usually at the turn of the month).

Usage:  traf_rpt_monthly.pl OUTPUT YYYY MM 

  1. We use pmacct to satisfy our IP accounting needs. (1)