Posts Tagged ‘windows’

Automated server updates

Wednesday, March 10th, 2010

This is going to be a contentious one, but here at Anchor we think automatically applying updates to servers is a Good Thing. It’s definitely not for everyone, but in an environment like ours with hundreds of managed servers it’s the only way you’re going to get things done and get any sleep at night.

Sysadmin of note Tom Limoncelli advocates rolling out updates to progressively more machines with prior testing beforehand to mitigate the scope of potential problems (it’s called “one, some, many”). It’s certainly a good strategy for a large number of homogenous computers, but what we’re talking about here is a bit smaller-scale.

Specifically, we have customers with servers that we never touch, we call this Anchor Monitor. These customers often have particular environments that they’re better off supporting themselves, so we monitor the machine to ensure it’s still on the network, and leave it at that. Unfortunately they’re not always kept up to date, so one of the more recent improvements to our process has been to enable automatic updating by default – it’s up to the customer if they want to change this once it’s handed over to them.

We’ve written this into a short procedure if you’re interested. It applies directly to Debian and Redhat distributions, but it’s easily portable to other systems. If you run Windows, it’ll already be hassling you every 20min for updates. :)

When HA won’t play the way you want it to

Tuesday, September 8th, 2009

In an ideal world every service would support High Availability and Load Balancing, would scale up easily and cleanly and all of us systems administrators would be paid bucketloads to play golf all day while the computers did all the hard work. To quote Dylan Moran of Black Books fame, “Don’t make me laugh…bitterly”.

I’ll cut to the chase – sometimes you have to really shoehorn technologies to do what you want. Fortunately I love doing this, and the technologies of today’s article are virtualised Windows 2008 on Xen, and Oracle XE 10g. Neither likes to play ball, for a few reasons:

  • Generally speaking, when you virtualise an OS you want to have para-virtualisation drivers enhancing the hardware support. Open Source Xen has PV drivers, but they are not signed with a legitimate certificate. Windows 2008 does not play nicely with unsigned or test-cert-signed drivers.
  • Oracle is just a messy, messy, nasty thing. Yes, paid versions undoubtedly support all manner of loadbalancing and HA options, but the free one does not.

Adding HA to Windows 2008 on Xen

The basic procedure was as follows:

  • Install the telnet server within Windows (making sure to lock it down in the firewall to only be accessible by the host machines)
  • Create a special admin account and password used for triggering a shutdown
  • Create an Expect script which logs into the VM via telnet, and issues the shutdown command
  • Create a modified version of the Heartbeat Xen resource agent which calls the expect script to shut down the VM (and wait a safe period of time) before “xm shutdown” is called. Without this, “xm shutdown” will simply power off the VM (in absence of working PV drivers).

The VM was already running on a DRBD volume between the two HA Xen servers, so I was able to just create a standard set of Heartbeat resources to control DRBD primary/secondary mode and the startup/shutdown of the HA WIndows VM. For your benefit (if you want to recreate it) here is the expect script:

#!/usr/bin/expect -f
#
# Script which "automates" shutting down a Windows VM

# Don't log telnet output and commands to stdout, and set a reasonable timeout.
log_user 0
set timeout 3

# Log in via telnet and issue commands. Fairly straightforward.
spawn -noecho /usr/bin/telnet 192.168.1.1
sleep 0.5

# login as the "shutdown" user
expect {
 -re "login: $" {send "shutdown\r"}
 timeout exit
}
sleep 0.5
expect {
 -re "password: $" {send "mysecretpassword\r"}
 timeout exit
}
sleep 0.5
expect {
 -re ">$" {send "shutdown /s /t 0\r"}
 timeout exit
}
sleep 0.1
expect {
 -re ">$" {send "exit\r"}
 timeout exit
}
exit

The rest is fairly self-explanatory if you understand Heartbeat.

Oracle XE 10g

This was more of a learning process, since usually you just install Oracle and leave it the hell alone. Not so for me.

  • Install Oracle on both nodes using (fortunately) the RPMs they provide
  • Configure Oracle on both nodes including creating the databases, using the same password for SYSDBA
  • Shutdown both instances of Oracle
  • Create the DRBD resource, and mount it on the primary node
  • On the primary node, move the contents of /usr/lib/oracle/xe/oradata and /usr/lib/oracle/xe/app/oracle/flash_recovery_area onto the mounted DRBD
  • On the secondary node, delete the aforementioned paths
  • Bind mount the oradata and flash recovery area from the mounted DRBD volume into the correct places in the directory tree.
  • Start Oracle

After I had created a Heartbeat resource group which contained the DRBD resource, the DRBD filesystem mount, the aforementioned bind mounts and the Oracle service itself I was quite pleased to see that Oracle plays quite nicely with our shoehorned HA setup. You’ll want to make sure you have a properly fixed Oracle init script though, as the supplied one is fairly bad.

After making Oracle and Windows 2008 work nicely in HA, I’m almost certain any service no matter how bad can be shoehorned in a similar way to give you decent availability even when it was n’t originally intended.

This just in, from the Department of the Bleedin’ Obvious

Tuesday, September 8th, 2009

I kid you not, we just received this in a piece of marketing guff from our favourite enterprise vendor.

“Industry analysts predict that Linux and Windows will soon dominate the operating system space. How you respond to this is critical.”

Meanwhile, industry analysts predict that more than 98% of the population will be consuming oxygen by 2010.

A great Windows FTP & SFTP Client

Wednesday, April 22nd, 2009

A question I get asked reasonably often is “Do you know any good free FTP programs?” Yes, I do. It is WinSCP.

Some of the cool features are:

  • It does what it is designed to do and does it excellently.
  • SFTP, SCP & FTP support (ditch FTP and use SFTP!)
  • I’ve never seen it crash.
  • Transfer resuming on broken and cancelled downloads.
  • Supports SSH keys, so you do not need to remember another password.
  • Scripting support; schedule your own remote backups or have sane website rollout procedures!

The WinSCP site describes it as “WinSCP is an open source SFTP client and FTP client for Windows. Its main function is the secure file transfer between a local and a remote computer. Beyond this, WinSCP offers basic file manager functionality. It uses Secure Shell (SSH) and supports, in addition to Secure FTP, also legacy SCP protocol.”

You can download it from here and the obligatory screen shots can be found here.

All of Anchor’s shared hosting plans support SSH & SFTP connections. If you want to read more about how to use SSH, we have some wiki articles that we prepared earlier. These were targeted to cover dedicated & VPS servers however they are still relevant.

If you manage an important site, rollout scripts can really make your web site updates pain free. I encourage anyone not using rollout scripts to have a look at the scripting capabilities of WinSCP.

Filebucketing to the MAXXXXX

Thursday, March 12th, 2009

Every now and then we see an example of application failure so astounding it literally brings tears to our eyes. We have a client whose legacy application is unfortunately still running on an ancient version of Oracle Weblogic and which must be maintained until the new, flashy .NET version of their site is complete.

We were alerted this morning to a problem with some of the Weblogic content – the pages were timing out. Diagnostics were fairly fruitless – packet captures showed nothing useful, and the logging from Weblogic left much to be desired. We started considering more outlandish possibilities such as I/O load causing issues, recently applied updates and so on. Even rebooting was considered (given it is running on Windows).

The first clue of note was the open file list from the Weblogic processes – one such example stood out:

C:\weblogic\state\Sa0V\b1gR\O1Ok\WqYN\9kiv\IQT2\SHGx\C3ri\aE1z\L1YH\X5QW\
gdkB\B2PB\pPPw\uHDK\p1a7\I0l5\94sU\kQ43\+533\5517\5738\7484\6253\_-10\
6273\1519\_6_8\888_\8888\_700\2_702_8\888_

For the sake of your screen, I have manually wrapped this Godzilla-like filename.



Perhaps you are familiar with file bucketing already, but if not, typically the directory structure used will have a relatively sane scheme for locating files and only extend a few levels deep. What we saw in this instance was a completely new breed of monster. Admittedly the absolute path of this file is less than 200 characters out of a limit of more than 32,000 but the naming strategy and depth of the structure has us flummoxed.

But this was only the tip of the proverbial iceberg. When we requested Windows to show us the properties of this state folder it took over an hour to completely calculate the file and folder totals, and the result is impressive:

Web logic makes efficient use of the filesystem

Web logic makes efficient use of the filesystem

Yes you read that right – over 10 million nested directories. By this stage we had already moved the state directory out of the way and created a new one, and restarted Weblogic. It seemed happy and quite responsive after that. My suspicion is that someone developing this application at some point ran into a limitation with their filebucketing algorithm, and resolved to solve the problem once and for all, evidently by making it possible to efficiently filebucket every file in the known universe.

Patch Tuesday again

Monday, December 15th, 2008

If you’re one of our dedicated server customers, you’ve got the option of a paid support package, the choices being Anchor Secure and Anchor Complete. Whatever you choose (or if you decide you don’t need one), we just hope it’s the right one for you.

One of the services we provide with a support package is keeping your system up to date. For Linux machines this means installing updated packages as they’re released, and for Windows this means staying on top of Windows Update. We can do a lot of this without you ever noticing, but Windows Updates almost always require a reboot of the machine, which we schedule with our customers by email.

This brings us to an amusing little snippet from one of our customers.

Anchor: We’re going to reboot your server next Wednesday at about 11pm, please tell us if that’ll cause any problems.

Customer: Ah, I was not aware of Microsoft’s update schedule.

Great Lord almighty, there are undiscovered tribes in the Amazon that know that Microsoft releases patches on the second Tuesday of the month!

Bug report: “all” does not mean all, for some values of “all”

Tuesday, November 18th, 2008

We’ve discovered some interesting things about Windows, and they never fail to cause some head-scratching. We had cause to go rooting through a customer’s wordpress installation recently to hunt down the cause of PHP errors, and discovered two WTFs here.

The first was the breakage of various scripts in the wp-admin directory. Through means unknown, every array definition was broken by the addition of a file path. If you grok PHP, you’ll recognise that this isn’t syntactically valid:

$defaults = array(
'show_option_all'../../../wordpress/wp-includes/ => '',
'show_option_none'../../../wordpress/wp-includes/ => ''
);

Python is our preferred in-house language, but breadth of knowledge is more important for a sysadmin. Cleaning up the PHP was a snap, but it’s a mystery as to how this happened in the first place; according to the customer it “just stopped working”. It looks a bit like someone got busy with a site-wide find-and-replace. This isn’t implausible, but it seems far less likely given that this is on a Windows machine.

(more…)

A tale of two drives

Thursday, October 9th, 2008

It’s no secret that we’d rather be working on Linux than Windows here at Anchor. It is, by and large, much more annoying to actually get anything done, but it also just breaks in opaque and unexplained ways. O Windowes, let me count the ways in which you are broken! This is one such problem we ran into yesterday.

Hard drive failure is a fact of life when you run servers, by sheer virtue of that fact that you have hundreds of them. To mitigate the risk and reduce unscheduled downtime, we use Window’s built-in software RAID feature. It’s not an enterprise solution, but it gets the job done. What’s important is staying online and not losing data.

Did I mention that trying to monitor a Windows box is a nightmare? A colleague of mine wrote a script to allow us to keep a watchful eye on Windows RAID volumes, it’s a lifesaver. A recently-deployed machine got a broken mirror, which we were able to act on immediately. We removed the dodgy mirror and prepared a replacement (we always have plenty of spares, of course). Allow me now to re-enact this scene…

Windows (sounding almost efficient): The driver has detected that device \Device\Harddisk1\DR9 has predicted that it will fail

Sysadmin: Thanks, Windows, I’ll get right on that. You didn’t say whether that was SMART, or just voodoo, but whatever, it’s good to know.

The bad drive is removed and a replacement installed in the hotswap drive bay

Sysadmin: Okay, Windows, do your stuff. “Scan for new hardware”, please.

A pause.

Sysadmin: Ahem, Windows, “Scan for new hardware” and find my drive.

Windows: ‘Ey there, chaps. Do what now, you say? AIEEEEGRH!!

The server stops responding entirely, necessitating a touch of the reset button

Needless to say, we’re rather unimpressed, and have to call the customer to let them know why it’s just dropped offline.

A quick check of the logs is in order. It’s also frustrating that there’s no sane way to scroll through log entries in Windows with something like a text editor, or to “tail” a log as it’s updated in realtime.

09:36 – The previous system shutdown at 9:21:23 AM on 8/10/2008 was unexpected.

Okay, it went down at 09:21, which is correct. Now if we look back in time a little…

09:21 – dmio: Harddisk1 write error at block 1953524618 due to disk removal

*sigh* And this is after the disk was removed cleanly…

Site links
Anchor
Wiki
Blog
Services
Domain names
Web hosting
VPS
Dedicated Servers
Co-location
Articles
Dedicated Server Purchasing Guide
Dedicated Server Tutorials
Developer Friendly Hosting
Useful Tools