Thursday 17 October 2013

WYCRMS Part 6. It's OK, the Resilient Partner Can Take Over

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

6. It's OK, It's One of a Resilient Pair

No, it's not.

So I have two Active Directory Domain Controllers. They both run DHCP with non-overlapping scopes, have a fully replicated database and any client gets both servers in a DNS query so load-balancing is pretty good. Only they're not a single service. They may be presented to users as such but they are distinct servers, running distinct instances of software that, apart from sending each other directory updates themselves, don't cooperate. A user locates AD services using DNS, and simply rebooting a server leaves those DNS entries intact. You've now intentionally degraded your service (only half of your nodes are active), without giving your clients a heads-up.

Sure, DNS timeouts will direct them to the surviving node eventually but why would you intentionally degrade anything when it's avoidable? It's only thirty seconds you say? Why is this tolerable to you?

Failover cluster systems are also not exempt. One of the benefits of these clusters  is that a workload can be moved to (proactively) or recovered at (after a failure) another node. Only failover clustering is shared-nothing, so an entire service has to be taken offline before it is started on the other node. Again this involves an outage, and as much as Microsoft have taken pains to make startup and shutdown of e.g. SQL Server much quicker than in the past, other vendors are likely not as forgiving. It's astonishing how quickly unknown systems can come to rely on the one you thought was an island, but suddenly don't know how to handle interruptions.

Only in the case of an actively load-balanced cluster can taking one node down be said to be truly interruption-free. When clients request service, the list of nodes returned does not contain stale entries. When user sessions are important, the alternate node(s) can take over and continue to serve the client without a blip or login prompt. In case you're confused, this is still no reason to shut down a server anyway, refer to the previous five articles if you're leaning that way, and if you're thinking leaving a single node to keep on churning the load then you haven't quite grasped what resilience is there for.

The point of a resilient pair is that it is supposed to survive outages, not as a convenient tool for admins to perform disruptive tasks hoping users won't notice. There's a similar tendency for people to use DR capacity for testing, without considering whether the benefits of that testing are truly greater than the reduction or elimination of DR capacity.

Application presentation clusters (e.g. Citrix XenApp) is a favourite target for reboots, and is the most often-cited area where these reboots are best-practice. Here it is: The only vendor-published document I have found in the last five years current software advocating a scheduled reboot. Citrix' XenDesktop and XenApp Best Practices Guide page 59. It is poor to say the least:

A rolling reboot schedule should be implemented for the XenApp Servers so that potential application memory leaks can be addressed and changes made to the provisioned XenApp servers can be reset. The period between reboots will vary according to the characteristics of the application set and the user base of each worker group. In general, a weekly reboot schedule provides a good starting point.

More imprecise advice is hard to find in technical documents. How exactly does the administrator, engineer or designer know the level of his "potential" exposure to memory leaks? I've spent some time exploring this issue in the previous articles, and I stand by my point - if an administrator tolerates poor behaviour by applications or - worse - the OS itself without making an attempt to actually correct the flaw (e.g. contacting the vendor to demand a quality product), that administrator is negligent, scheduled reboots are a workaround, and nobody can have a reasonable expectation of quality service from that platform.

But most of all: How are you ever going to trust a vendor who has so little faith in their product that it cannot tolerate simply operating? I'm not singling out Citrix here, but their complacency in the face of bad code is shocking. I admire Citrix, so I'm not pleased at this display of indifference. Best practice I guess...

Then we get to sentences two and three of this three-sentence paragraph, which informs our reboot-happy administrator to try a particular schedule without a definitive measure in sight. There's a link on how to set up a schedule and how to minimise interruption while it happens, but not one metric or even a place to find them is proposed. He/she is given a vague "meh, a week?" suggestion with zero justification, apart from being "feels-right"-ish.

If a server fails, it is for a specific reason. Sometimes this is buried deep in kernel code, the result of interactions that can never be meaningfully replicated, or  much more exotic reasons. In most cases however it is because of a precise reason (memory leaks included), and computing is honestly not so hard that these cannot be fixed.

You might tell I'm an open-source advocate, because I firmly believe in reporting bugs. I also believe I ge tto see the response to that bug. I've found some projects to be more responsive than others, but generally if I've found something that is not just broken but damaging I see people hopping to attention - and that's people volunteering.

If you're buying your software from a vendor they have that floor to start from in their response to you. Tolerate nothing less than attention, and get your evidence together before they start pointing fingers.

When you work in a large organisation you realise things have designations and labels for a reason. Resilient pairs are for unanticipated failures, and DR servers are for disasters.

You don't get to hijack a purpose just because it's unlikely it will be needed - they exist precisely for the unlikely.

Previous: WYCRMS Part 5. Nobody Ever Runs a Server That Long

No comments:

Post a Comment