Tuesday, 5 November 2013

WYCRMS Part 7. I Don't Think You Understand What A Server Is

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

7. I Don't Think You Understand What A Server Is

It's taken me a long time to find the right words to explain this title, because it's a bit contentious. This is a longer article, but I hope it's of value and I hope expresses my sense of pride and love of computing.

I've explained in previous posts that Windows Server is accessible to those new to IT engineering because it is a simple learning path from the system most of us have in our homes, schools, libraries and elsewhere. Moving to depending on a server to provide service takes different people different amounts of time to get right, and some organisations are more tolerant of slip-ups than others.

There is something that starts to take over. It's not obvious, and since it applies to something commonplace and often the subject of passion for those true geeks, it is also often unrecognised in oneself.


In no other industry would a consumer or customer tolerate advice such as rebuilding a product, service or transaction because of a minor fault. No mechanic would expect to be paid if they told their customers that the brakes may fail after an hour on the freeway, advising the driver to just come to a complete stop and then move off again, yet this is what we in the (Windows) IT Service industry resort to all-too-otfen. It's a mild paranoia that all things Microsoft (and others, as I will show) are prone to misbehave, and I've been told this explicitly, in writing, more than once from multiple providers after I've requested a feature or function be deployed. Providing vendor documentation in favour of the product's capabilities doesn't sway these types, as they've experienced their own horror stories of staying up until 2am while functionality is restored. I know, I've been in that trench.

But imagine for a moment if you'd gone to a garage affiliated directly with your car's manufacturer. A Ford mechanic providing me the aforementioned brake-fault workaround would be held to account if he also displayed his certification of training or affiliation with Ford Motor. I could challenge him and go to his management to insist that real advice and a fix is provided for my product. If he could show you that his advice was sound my beef would be with Ford selling me an underperforming product. I'd have recourse.

IT engineers seem to think they don't, either in being the place to receive this criticism (hey, they didn't write the software they just run it), or in being able to back up their fears with proof that vendor products in fact don't behave as they would hope. Yet the offices of IT service providers proudly display partner certifications, while engineers with IT credentials flowing of their resumes continue to fear the products that are their livelihood and the foundation for the logos they too show off with some pride. Doublethink indeed.

I am very active in reporting bugs in open-source and even paid-for products, because I expect that the product is only good as those who help make it better. I've already mentioned that failure to even start reporting faults to vendors is negligent, and how engineers should have more pride in their platforms and confidently defend it from detractors with authoritative sources. What I haven't spoken about is the relationship those engineers have with the platform itself.

There is a widely held belief in ICT over a decade old that is demonstrably false: Ethernet autonegotiation is quirky, and potentially dangerous. There was a time this was true, but not for at least five years since 100Base-TX lost it's leading position in datacenter, server and finally desktop connectivity method. Implementing Fast Ethernet (as 100Mb Ethernet was known) needs some knowledge of standards. When the Institute of Electrical and Electronics Engineers (IEEE) published their 802.3u standard for Fast Ethernet, I recall an interview with one of the panel members who stated that it is technically possible to run 100Mbit over stacheldracht (German for barbed wire) and you may have success. He made it clear though that your experience cannot be guaranteed as it is not 802.3u-compliant.

That's the crux: when a vendor states something as a standard, part of a reference architecture, or included in their documentation, they're making a promise.

The section of the new standard dealing with how the two nodes selected the operating speed and duplex setting was, unfortunately, not precise enough and open to some interpretation. Cisco and a few other vendors chose one interpretation while everyone else chose another, and the resulting duplex mismatch is notoriously hard to diagnose, occurring as it does only at moderate load and a ping test over an idle cable will likely succeed. It's insidious, and resulted in the universal abandonment of Autonegotiation in implementations (especially datacenters and core networks).

The problem is, Autonegotiation is not only working well in Gigabit Ethernet (over twisted-pair copper, or Cat5e/Cat6 cabling), it is mandatory. Even network professionals, burnt previously in the 90s and later with Fast Ethernet, advise against turning on a feature that is explicitly required to be a truly standards-compliant implementation, with all the promises attached. A prime reason is that the applicable line is buried deep inside section 28 of the IEEE-802 standard, as amended for Gigabit. It's dry reading...

Gigabit Ethernet was a big jump forward that started to seriously tax memory buses and CPUs like no other iteration before, and includes a highly valuable feature known as a Pause Frame to stop transfers flooding receive buffers and being dropped. This facility is only used if the opposite end cooperates, and the only mechanism to advertise this is autonegotiation.

I've seen an implementation of Microsoft Exchange 2010 come to its knees for lack of Pause Frames, and it is again an insidious failure since packets are only dropped under load, and ping tests and even high-load throughput tests succeed. It is the clinging to an old wisdom without knowing the cause, and then failing to keep up with developments that has caused this issue. Not running with Autonegotiation means you aren't running a standards-compliant Gigabit Ethernet network, and all promises are void.

Not following vendor advice is a bad idea. If the vendor promises a feature that you feel is not ready for primetime then by all means hold off. But if I expect something to work that a vendor promises will work, I don't expect to be told war stories of how this breaks - especially when I last saw that issue, myself, over 12 years ago. It's old thinking, stuck in past fears, and it's stopping you from unleashing your platform's potential. Windows Server especially has become a solid, dependable and performant platform, yet doubts linger and fears cling to dark corners, an uneasiness that is sometimes not even apparent to those harbouring it.

I enjoy reading on the history of computing, and contemplating how modern computers implement both Harvard and von-Neumann architectures depending on how closely you're looking. It's esoteric to speak of privilege rings or context switches, but knowing these things has been of immense help to round out my understanding of computing and gain trust in the models deployed. But the biggest thing I would like to see engineers embrace is this:

The Turing Machine

It's a simplistic representation of any computer, from your old calculator wristwatch to supercomputing clusters: A processor reads instructions from a sequence, implements those instructions with some data, and stores the result somewhere before moving to the next instruction. The next instruction may be a pointer to a different instruction, but all of computing boils down to this concept. there may be more than one processor, and there may be complex layouts of memory, but at its most basic form every computer works this way, and building on your model of a system's internals starts here.

It is deterministic, in that the state after an instruction is performed can be predicted from the initial state. In principal all of computing conforms with this principle, and any unexpected behaviour simply means the initial state was not well-understood enough. It is this mountain that engineers need to climb to truly excel at their profession, and I've met some expert climbers in my time. They have no fear of digging down to each root cause, and unearthing an even deeper root.

Rebooting is not the answer. It indicates a lack of knowledge on cause of faults. It is a sign of an unwillingness to investigate further. Worst, it is a misunderstanding of what your server is, what is meant to do, and the longer you allow that mentality to perpetuate the worse off you will be.

Old tales have value, but they are no substitute for knowledge and verifiable fact. If those facts contradict your experience, investigate, shout at vendors, check your implementation.

But most of all, be proud of your platform, because as obscure as It appears ot be, it is genuinely not that hard if you are willing to do better.

Previous: Part 6. It's OK, the Resilient Partner Can Take Over

No comments:

Post a Comment