GitHub: Designing Success

By September 28, 2009 Technical 5 Comments

At Anchor we do not believe in black box solutions.  Sharing is caring and we like to share. In this post we specifically want to share our triumph with Project StarBug, better known to the wider world as GitHub. For the uninitiated, GitHub is ‘Social Networking meets Source Code management’, or in GitHubs own words ‘Git is a fast, efficient, distributed version control system ideal for the collaborative development of software. GitHub is the easiest (and prettiest) way to participate in that collaboration: fork projects, send pull requests, monitor development, all with ease.’.

Some readers may protest this point, stating that GitHub is hosted in the USA while Anchor is located in Australia. How then has Anchor architected, implemented and (going forwards) manage GitHub’s infrastructure with such a geographical encumbrance?

All will be revealed in a blog entry in three of many parts.

Part 1: (This Post) Designing for success (Otherwise known as: Making GitHub’s dream a reality and nightmares a thing of the past)

Part 2: Speed matters

Part N: (To be announced)

For obvious reasons, we cannot expose GitHub’s architecture in full, however we are sharing some of the more interesting technologies/architecture we have implemented, and the rationale for doing so. Essentially what we have done to make GitHub’s dreams a reality.

Geographical encumbrance

It is a credit to GitHub’s management that they were willing to look the world over for the right team to support them. While they do not want to be harried by anything outside the GitHub application (i.e. Hardware, O/S, Management, etc), they still needed to ensure that the right company was employed to look after these components.

Why Anchor? Anchor’s flexibility to manage a solution on third-party hosted hardware (anywhere in the world) and versatility in developing an architecture to suit this scenario were part of the rationale. Anchor’s reputation for needing to know how technology works (again, no black boxes) and then working out how to improve it was a major contribution.

Enough fluff, now to the meat;

One can imagine that the architecture required to support GitHub is complex mix. We won’t lie; there are many moving parts. Some of the key criteria for designing the solution included:

Scalability

GitHub states it growth as “400 new users and 1000 new repositories every day”. Post migration GitHub will be running on infrastructure spread across 15+ physical hosts/servers. It is essential that the infrastructure can grow with the user base, from 10’s  to 100’s of servers, without the need to re-architect everything. Without a doubt, growing without the associated pain is a major objective for GitHub as it moves forward.

Interesting Note: GitHub’s new physical infrastructure (at migration) consists of:

  • 15+ physical servers
  • 10+ virtual servers
  • 128 physical processor cores
  • Over 288GBs RAM
  • 1TB+ of storage

GitHub’s software architecture is modular by nature and scalability friendly. Components outside the core software, however, were not as readably scalable. This has been achieved with the following improvements;

  • Distributed Storage Architecture (with real-time slaves). Distribution of GitHub’s source code repos across multiple partitions and multiple nodes (including redundant slaves) provided improvements in performance, scalability and reliability. By removing the limitation of using a single filesystem volume for storage, the issue of dealing with large scale storage has been avoided. New partitions can be rapidly added on demand with little to no fuss.

The graphic below illustrates a simplified request to the distributed file storage repo:

GitHub Repo Storage Distribution Illustration

GitHub Distributed Repo Storage

  • (Sensible) Virtualisation. Previously, GitHub’s infrastructure was entirely virtualised. While virtualisation has its merits, there are reasons to avoid it. Services that aren’t I/O-heavy can be virtualised, while components with high I/O requirements are run on dedicated (“bare metal”) servers. For GitHub, this means file storage and databases are not virtualised. Otherwise, virtualisation is used to provide a mix of server consolidation, rapid deployment and service redundancy/HA.
  • Horizontal scalability (on-demand, via automated build infrastructure). The ability to add additional components to the infrastructure in an automated fashion reduces scale-out time and removes user error from builds/configuration. In addition, this also turns the server build/deployment procedure into a measurable deliverable. Over time this can be review and improved (Thank you W. Edwards Deming).

Reliability

As with most businesses, High Availability (or business continuance) is essential to a success. To achieve this a combination of DRBD, virtualisation, heartbeat and load balancing has been employed.

  • Mirroring Data; DRBD is utilised for several purposes.
  1. It is used to ensure the redundant (read: slave) storage partitions and nodes are in sync with the active counterparts.
  2. DRBD is also key in providing HA functionality across the virtualised environment.

Several Xen hosts are deployed with the following scenario; Server 1 runs VM A(active) B(active) C(offline DRBD mirrored) D(offline DRBD mirrored), and Server 2 runs VM A(offline DRBD mirrored) VM B(offline DRBD mirrored) VM D(active) VM E(active). This provides active failover if either of the virtualisation hosts fail.

The graphic below illustrates the replicated, highly-available storage architecture:

GitHub Storage HA/Replication

GitHub Storage HA/Replication

  • Consistency; via automated builds and configuration management. With any horizontally-scaled solution, consistency amongst similar components is essential. One of the most notable achievements across the entire architecture is the complete integration of automated build infrastructure. A new/additional component of the solution can be rapidly built and added to the overall system regardless of the architecture (physical or virtual).
  • Redundancy; A simple way to ensure greater uptime and lower the risk of service interruption is to introduce as much redundancy as possible. GitHub is a great example of this practice. Data links, Ethernet/switching, server and components all have a redundant twin ready to swing into action should the primary fail.

Conclusions

The implementation of any new architecture for an already mature product is never easy. Anchor engineers have been working tirelessly with GitHub staff to ensure the any growing pains are transparent to the users. In the next entry, we will be sharing some of our insights in regard to migrating GitHub from their existing host and infrastructure to the new Anchor developed model. Until then, we hope you enjoy the new faster GitHub, more of the time (well, all/any of the time) than ever before.

  • http://junglist.gen.nz fujin

    Very nice. Wanted to know if you guys were using any configuration management software for horizontal deployment?

  • keiran

    Yes, we’ve primarily used puppet for our configuration management.

    This will allow us to rapidly deploy any additional machines as they are required to scale out the infrastructure.

  • http://andrew.tj.id.au/ Andrew TJ

    I’m curious as to whether Anchor will be doing a move away from everydns.net for GitHub’s DNS needs? I use everydns myself but I’m not sure they’re as bullet-proof as the infrastructure they point to on GitHub’s behalf.

  • matt

    Hi Andrew,

    That’s Github’s call, not ours — although Github’s DNS is extra-redundant now as we’re secondarying for them, so it should be extra reliable. Multicontinental DNS service ftw!

  • Pingback: Github forks their sysadmins! | Anchor Web Hosting Blog

Ready to talk business? Send us a note.