Architecture Matters: WYCRMS Part 5. Nobody Ever Runs a Server That Long

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

5. Nobody Ever Runs a Server That Long

Uptime

IT Engineers can take this stuff really seriously. I was quite proud of my own server at home that ran, uninterrupted, for one year and three months, during which time I upgraded the RAID array without any filesystem disruption, hosted at least twenty different types of VM (from Windows 3.11 to Windows 2008), served media files and MySQL databases, shared photos with Apache and personal files with Samba. Only the fast-moving BTRFS in-kernel driver broke me from that little obsession, but you don't need to run a Unix-variant to get that kind of availability.

Windows admins are simply used to "bouncing" computers to correct a fault. Hey, it works at home right? It's a complacency, a quick-fix, and often in response to an urgent need to restore service - troubleshooting can and often does take a back seat when management is standing over your shoulder.

Since Windows was a command launched at a DOS prompt, the restart has been a panacea, and often a very real one that unfortunately actually works. It almost always adds no insight into the fault's cause. Perhaps there's a locked file you're not aware of preventing a service from restarting, or a routing entry updated somewhere in the past that isn't reapplied when the system starts up again; there are myriad ways that a freshly started server can have a different configuration from the one you shut down, allowing service to resume.

Once a server is built, it is configured with applications and data, then generally tested (implicitly or explicitly) for fitness of purpose. Once that is done, it processes a workload and no more attention is required assuming all tasks such as backups, defragmentation etc are scheduled and succeed. Windows isn't exactly a finite-state machine (in the practical sense), but it is nonetheless a closed system that can only perform a limited set of tasks and failure modes should be easy to predict

Servers are passive things. They serve, and only perform actions when commanded to. Insert the OS installation DVD, run a standard installation, plug in a network cable, and the system is ready for action. Only it's not configured to take action just yet - it's waiting. In this state, I think most engineers would expect it to keep waiting for quite some time - perhaps a year or more. But add a workload - say a database instance or a website, and attitudes change.

I've had frequent discussions with engineers who will tell me things like "this server has been up for over a year, it's time to reboot it". Somewhere between an empty server and a billion-row OLTP webshop is a place where the server goes from idle and calm to something that genuinely scares engineers - just for running a certain amount of time.

When pressed for exactly which component is broken (or likely will) that this mystic reboot is supposed to fix, I never get anything specific, just a vague "it's best practice".

Windows Updates are frequently cited as a reason to reboot servers, and thanks to the specifics of how Windows implements file locking yes the reboot there is unavoidable. This leads to the unfortunate tendency to accept reboots as a normal part of Windows Server operation, but tend to see the reboot as the point (with an update thrown in since the server is rebooting anyway) instead of an unfortunate side-effect. I realise the need to keep servers patched, but again, when pressed for a description of which known defects (that have actually or could probably affected service) a particular update application - with associated downtime - will fix, the response comes in: "Um, best practise?".

In the absence of an actual security threat, known defect fix or imminent power failure, I am rarely convinced to shut a server down. I first included "vendor recommendation" in that list, but realised I've yet to see one. Ever.

Even at three in the morning when no sane customer could be relying on a system, during a once-quarterly change window when all services are nominated unavailable so service providers can make radical changes, even then: No, you can't reboot my server.

If engineers took the time to think about where in the continuum from empty server to complex beast the point of fear arrives, they can figure out which bit is scaring them and make sure those are well-understood, properly configured and maintained.Unfortunately, that takes time, effort and sometimes a bit of theory and modelling. Rebooting is so much less effort.

Windows Server can, and should, be expected to remain ready for service for as long as the hardware can last. With the advent of virtualisation and VMotion, even that obstacle is gone, and the limits are practically nowhere to be found. Applictions are another story, and if the developer/support specialist think they need restarting, that's fine, but they have zero authority to suggest this for Windows.

I've heard the phrase "excessive uptime" identified as the root cause of outages. I doubt Microsoft would like to know the engineers they certify are - and I don't say this lightly - genuinely afraid of their product doing its job, as designed, for years. If that doesn't happen and a genuine OS fault occurs that only a reboot can solve, it is quite shocking how many engineers will actually report this problem to the vendor and tolerate workarounds, design hacks and cludgy scripts.

In the same way that one can learn a procedure for changing the spark plugs on a specific model of engine, while completely missing the black smoke ejected from the exhaust thanks to a chronic ignition timing failure, so too engineers who have not yet attained a mediocre grasp of computing theory can continue to diagnose and treat only symptoms.

A server failing to continue to do its core function of staying up is not a mild symptom that a reboot can fix. It is a fundamental failure of the product, and failing to do the hard thing of actually understanding why and demanding the vendor improves their product does nobody any service.

In fact, it's negligent.

Previous: Part 4. Windows Updates and File Locking

Next: Part 6. It's OK, the Resilient Partner Can Take Over

Architecture Matters

Monday, 14 October 2013

WYCRMS Part 5. Nobody Ever Runs a Server That Long

No comments:

Post a Comment

Twitter

Followers