Tuesday, 5 November 2013

WYCRMS Part 7. I Don't Think You Understand What A Server Is

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

7. I Don't Think You Understand What A Server Is

It's taken me a long time to find the right words to explain this title, because it's a bit contentious. This is a longer article, but I hope it's of value and I hope expresses my sense of pride and love of computing.

I've explained in previous posts that Windows Server is accessible to those new to IT engineering because it is a simple learning path from the system most of us have in our homes, schools, libraries and elsewhere. Moving to depending on a server to provide service takes different people different amounts of time to get right, and some organisations are more tolerant of slip-ups than others.

There is something that starts to take over. It's not obvious, and since it applies to something commonplace and often the subject of passion for those true geeks, it is also often unrecognised in oneself.


In no other industry would a consumer or customer tolerate advice such as rebuilding a product, service or transaction because of a minor fault. No mechanic would expect to be paid if they told their customers that the brakes may fail after an hour on the freeway, advising the driver to just come to a complete stop and then move off again, yet this is what we in the (Windows) IT Service industry resort to all-too-otfen. It's a mild paranoia that all things Microsoft (and others, as I will show) are prone to misbehave, and I've been told this explicitly, in writing, more than once from multiple providers after I've requested a feature or function be deployed. Providing vendor documentation in favour of the product's capabilities doesn't sway these types, as they've experienced their own horror stories of staying up until 2am while functionality is restored. I know, I've been in that trench.

But imagine for a moment if you'd gone to a garage affiliated directly with your car's manufacturer. A Ford mechanic providing me the aforementioned brake-fault workaround would be held to account if he also displayed his certification of training or affiliation with Ford Motor. I could challenge him and go to his management to insist that real advice and a fix is provided for my product. If he could show you that his advice was sound my beef would be with Ford selling me an underperforming product. I'd have recourse.

IT engineers seem to think they don't, either in being the place to receive this criticism (hey, they didn't write the software they just run it), or in being able to back up their fears with proof that vendor products in fact don't behave as they would hope. Yet the offices of IT service providers proudly display partner certifications, while engineers with IT credentials flowing of their resumes continue to fear the products that are their livelihood and the foundation for the logos they too show off with some pride. Doublethink indeed.

I am very active in reporting bugs in open-source and even paid-for products, because I expect that the product is only good as those who help make it better. I've already mentioned that failure to even start reporting faults to vendors is negligent, and how engineers should have more pride in their platforms and confidently defend it from detractors with authoritative sources. What I haven't spoken about is the relationship those engineers have with the platform itself.

There is a widely held belief in ICT over a decade old that is demonstrably false: Ethernet autonegotiation is quirky, and potentially dangerous. There was a time this was true, but not for at least five years since 100Base-TX lost it's leading position in datacenter, server and finally desktop connectivity method. Implementing Fast Ethernet (as 100Mb Ethernet was known) needs some knowledge of standards. When the Institute of Electrical and Electronics Engineers (IEEE) published their 802.3u standard for Fast Ethernet, I recall an interview with one of the panel members who stated that it is technically possible to run 100Mbit over stacheldracht (German for barbed wire) and you may have success. He made it clear though that your experience cannot be guaranteed as it is not 802.3u-compliant.

That's the crux: when a vendor states something as a standard, part of a reference architecture, or included in their documentation, they're making a promise.

The section of the new standard dealing with how the two nodes selected the operating speed and duplex setting was, unfortunately, not precise enough and open to some interpretation. Cisco and a few other vendors chose one interpretation while everyone else chose another, and the resulting duplex mismatch is notoriously hard to diagnose, occurring as it does only at moderate load and a ping test over an idle cable will likely succeed. It's insidious, and resulted in the universal abandonment of Autonegotiation in implementations (especially datacenters and core networks).

The problem is, Autonegotiation is not only working well in Gigabit Ethernet (over twisted-pair copper, or Cat5e/Cat6 cabling), it is mandatory. Even network professionals, burnt previously in the 90s and later with Fast Ethernet, advise against turning on a feature that is explicitly required to be a truly standards-compliant implementation, with all the promises attached. A prime reason is that the applicable line is buried deep inside section 28 of the IEEE-802 standard, as amended for Gigabit. It's dry reading...

Gigabit Ethernet was a big jump forward that started to seriously tax memory buses and CPUs like no other iteration before, and includes a highly valuable feature known as a Pause Frame to stop transfers flooding receive buffers and being dropped. This facility is only used if the opposite end cooperates, and the only mechanism to advertise this is autonegotiation.

I've seen an implementation of Microsoft Exchange 2010 come to its knees for lack of Pause Frames, and it is again an insidious failure since packets are only dropped under load, and ping tests and even high-load throughput tests succeed. It is the clinging to an old wisdom without knowing the cause, and then failing to keep up with developments that has caused this issue. Not running with Autonegotiation means you aren't running a standards-compliant Gigabit Ethernet network, and all promises are void.

Not following vendor advice is a bad idea. If the vendor promises a feature that you feel is not ready for primetime then by all means hold off. But if I expect something to work that a vendor promises will work, I don't expect to be told war stories of how this breaks - especially when I last saw that issue, myself, over 12 years ago. It's old thinking, stuck in past fears, and it's stopping you from unleashing your platform's potential. Windows Server especially has become a solid, dependable and performant platform, yet doubts linger and fears cling to dark corners, an uneasiness that is sometimes not even apparent to those harbouring it.

I enjoy reading on the history of computing, and contemplating how modern computers implement both Harvard and von-Neumann architectures depending on how closely you're looking. It's esoteric to speak of privilege rings or context switches, but knowing these things has been of immense help to round out my understanding of computing and gain trust in the models deployed. But the biggest thing I would like to see engineers embrace is this:

The Turing Machine

It's a simplistic representation of any computer, from your old calculator wristwatch to supercomputing clusters: A processor reads instructions from a sequence, implements those instructions with some data, and stores the result somewhere before moving to the next instruction. The next instruction may be a pointer to a different instruction, but all of computing boils down to this concept. there may be more than one processor, and there may be complex layouts of memory, but at its most basic form every computer works this way, and building on your model of a system's internals starts here.

It is deterministic, in that the state after an instruction is performed can be predicted from the initial state. In principal all of computing conforms with this principle, and any unexpected behaviour simply means the initial state was not well-understood enough. It is this mountain that engineers need to climb to truly excel at their profession, and I've met some expert climbers in my time. They have no fear of digging down to each root cause, and unearthing an even deeper root.

Rebooting is not the answer. It indicates a lack of knowledge on cause of faults. It is a sign of an unwillingness to investigate further. Worst, it is a misunderstanding of what your server is, what is meant to do, and the longer you allow that mentality to perpetuate the worse off you will be.

Old tales have value, but they are no substitute for knowledge and verifiable fact. If those facts contradict your experience, investigate, shout at vendors, check your implementation.

But most of all, be proud of your platform, because as obscure as It appears ot be, it is genuinely not that hard if you are willing to do better.

Previous: Part 6. It's OK, the Resilient Partner Can Take Over

Thursday, 17 October 2013

WYCRMS Part 6. It's OK, the Resilient Partner Can Take Over

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

6. It's OK, It's One of a Resilient Pair

No, it's not.

So I have two Active Directory Domain Controllers. They both run DHCP with non-overlapping scopes, have a fully replicated database and any client gets both servers in a DNS query so load-balancing is pretty good. Only they're not a single service. They may be presented to users as such but they are distinct servers, running distinct instances of software that, apart from sending each other directory updates themselves, don't cooperate. A user locates AD services using DNS, and simply rebooting a server leaves those DNS entries intact. You've now intentionally degraded your service (only half of your nodes are active), without giving your clients a heads-up.

Sure, DNS timeouts will direct them to the surviving node eventually but why would you intentionally degrade anything when it's avoidable? It's only thirty seconds you say? Why is this tolerable to you?

Failover cluster systems are also not exempt. One of the benefits of these clusters  is that a workload can be moved to (proactively) or recovered at (after a failure) another node. Only failover clustering is shared-nothing, so an entire service has to be taken offline before it is started on the other node. Again this involves an outage, and as much as Microsoft have taken pains to make startup and shutdown of e.g. SQL Server much quicker than in the past, other vendors are likely not as forgiving. It's astonishing how quickly unknown systems can come to rely on the one you thought was an island, but suddenly don't know how to handle interruptions.

Only in the case of an actively load-balanced cluster can taking one node down be said to be truly interruption-free. When clients request service, the list of nodes returned does not contain stale entries. When user sessions are important, the alternate node(s) can take over and continue to serve the client without a blip or login prompt. In case you're confused, this is still no reason to shut down a server anyway, refer to the previous five articles if you're leaning that way, and if you're thinking leaving a single node to keep on churning the load then you haven't quite grasped what resilience is there for.

The point of a resilient pair is that it is supposed to survive outages, not as a convenient tool for admins to perform disruptive tasks hoping users won't notice. There's a similar tendency for people to use DR capacity for testing, without considering whether the benefits of that testing are truly greater than the reduction or elimination of DR capacity.

Application presentation clusters (e.g. Citrix XenApp) is a favourite target for reboots, and is the most often-cited area where these reboots are best-practice. Here it is: The only vendor-published document I have found in the last five years current software advocating a scheduled reboot. Citrix' XenDesktop and XenApp Best Practices Guide page 59. It is poor to say the least:

A rolling reboot schedule should be implemented for the XenApp Servers so that potential application memory leaks can be addressed and changes made to the provisioned XenApp servers can be reset. The period between reboots will vary according to the characteristics of the application set and the user base of each worker group. In general, a weekly reboot schedule provides a good starting point.

More imprecise advice is hard to find in technical documents. How exactly does the administrator, engineer or designer know the level of his "potential" exposure to memory leaks? I've spent some time exploring this issue in the previous articles, and I stand by my point - if an administrator tolerates poor behaviour by applications or - worse - the OS itself without making an attempt to actually correct the flaw (e.g. contacting the vendor to demand a quality product), that administrator is negligent, scheduled reboots are a workaround, and nobody can have a reasonable expectation of quality service from that platform.

But most of all: How are you ever going to trust a vendor who has so little faith in their product that it cannot tolerate simply operating? I'm not singling out Citrix here, but their complacency in the face of bad code is shocking. I admire Citrix, so I'm not pleased at this display of indifference. Best practice I guess...

Then we get to sentences two and three of this three-sentence paragraph, which informs our reboot-happy administrator to try a particular schedule without a definitive measure in sight. There's a link on how to set up a schedule and how to minimise interruption while it happens, but not one metric or even a place to find them is proposed. He/she is given a vague "meh, a week?" suggestion with zero justification, apart from being "feels-right"-ish.

If a server fails, it is for a specific reason. Sometimes this is buried deep in kernel code, the result of interactions that can never be meaningfully replicated, or  much more exotic reasons. In most cases however it is because of a precise reason (memory leaks included), and computing is honestly not so hard that these cannot be fixed.

You might tell I'm an open-source advocate, because I firmly believe in reporting bugs. I also believe I ge tto see the response to that bug. I've found some projects to be more responsive than others, but generally if I've found something that is not just broken but damaging I see people hopping to attention - and that's people volunteering.

If you're buying your software from a vendor they have that floor to start from in their response to you. Tolerate nothing less than attention, and get your evidence together before they start pointing fingers.

When you work in a large organisation you realise things have designations and labels for a reason. Resilient pairs are for unanticipated failures, and DR servers are for disasters.

You don't get to hijack a purpose just because it's unlikely it will be needed - they exist precisely for the unlikely.

Previous: WYCRMS Part 5. Nobody Ever Runs a Server That Long

Monday, 14 October 2013

WYCRMS Part 5. Nobody Ever Runs a Server That Long

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

5. Nobody Ever Runs a Server That Long


IT Engineers can take this stuff really seriously. I was quite proud of my own server at home that ran, uninterrupted, for one year and three months, during which time I upgraded the RAID array without any filesystem disruption, hosted at least twenty different types of VM (from Windows 3.11 to Windows 2008), served media files and MySQL databases, shared photos with Apache and personal files with Samba. Only the fast-moving BTRFS in-kernel driver broke me from that little obsession, but you don't need to run a Unix-variant to get that kind of availability.

Windows admins are simply used to "bouncing" computers to correct a fault. Hey, it works at home right? It's a complacency, a quick-fix, and often in response to an urgent need to restore service - troubleshooting can and often does take a back seat when management is standing over your shoulder.

Since Windows was a command launched at a DOS prompt, the restart has been a panacea, and often a very real one that unfortunately actually works. It almost always adds no insight into the fault's cause. Perhaps there's a locked file you're not aware of preventing a service from restarting, or a routing entry updated somewhere in the past that isn't reapplied when the system starts up again; there are myriad ways that a freshly started server can have a different configuration from the one you shut down, allowing service to resume.

Once a server is built, it is configured with applications and data, then generally tested (implicitly or explicitly) for fitness of purpose. Once that is done, it processes a workload and no more attention is required assuming all tasks such as backups, defragmentation etc are scheduled and succeed. Windows isn't exactly a finite-state machine (in the practical sense), but it is nonetheless a closed system that can only perform a limited set of tasks and failure modes should be easy to predict

Servers are passive things. They serve, and only perform actions when commanded to. Insert the OS installation DVD, run a standard installation, plug in a network cable, and the system is ready for action. Only it's not configured to take action just yet - it's waiting. In this state, I think most engineers would expect it to keep waiting for quite some time - perhaps a year or more. But add a workload - say a database instance or a website, and attitudes change.

I've had frequent discussions with engineers who will tell me things like "this server has been up for over a year, it's time to reboot it". Somewhere between an empty server and a billion-row OLTP webshop is a place where the server goes from idle and calm to something that genuinely scares engineers - just for running a certain amount of time.

When pressed for exactly which component is broken (or likely will) that this mystic reboot is supposed to fix, I never get anything specific, just a vague "it's best practice".

Windows Updates are frequently cited as a reason to reboot servers, and thanks to the specifics of how Windows implements file locking yes the reboot there is unavoidable. This leads to the unfortunate tendency to accept reboots as a normal part of Windows Server operation, but tend to see the reboot as the point (with an update thrown in since the server is rebooting anyway) instead of an unfortunate side-effect. I realise the need to keep servers patched, but again, when pressed for a description of which known defects (that have actually or could probably affected service) a particular update application - with associated downtime - will fix, the response comes in: "Um, best practise?".

In the absence of an actual security threat, known defect fix or imminent power failure, I am rarely convinced to shut a server down. I first included "vendor recommendation" in that list, but realised I've yet to see one. Ever.

Even at three in the morning when no sane customer could be relying on a system, during a once-quarterly change window when all services are nominated unavailable so service providers can make radical changes, even then: No, you can't reboot my server.

If engineers took the time to think about where in the continuum from empty server to complex beast the point of fear arrives, they can figure out which bit is scaring them and make sure those are well-understood, properly configured and maintained.Unfortunately, that takes time, effort and sometimes a bit of theory and modelling. Rebooting is so much less effort.

Windows Server can, and should, be expected to remain ready for service for as long as the hardware can last. With the advent of virtualisation and VMotion, even that obstacle is gone, and the limits are practically nowhere to be found. Applictions are another story, and if the developer/support specialist think they need restarting, that's fine, but they have zero authority to suggest this for Windows.

I've heard the phrase "excessive uptime" identified as the root cause of outages. I doubt Microsoft would like to know the engineers they certify are - and I don't say this lightly - genuinely afraid of their product doing its job, as designed, for years. If that doesn't happen and a genuine OS fault occurs that only a reboot can solve, it is quite shocking how many engineers will actually report this problem to the vendor and tolerate workarounds, design hacks and cludgy scripts.

In the same way that one can learn a procedure for changing the spark plugs on a specific model of engine, while completely missing the black smoke ejected from the exhaust thanks to a chronic ignition timing failure, so too engineers who have not yet attained a mediocre grasp of computing theory can continue to diagnose and treat only symptoms.

A server failing to continue to do its core function of staying up is not a mild symptom that a reboot can fix. It is a fundamental failure of the product, and failing to do the hard thing of actually understanding why and demanding the vendor improves their product does nobody any service.

In fact, it's negligent.

Previous: Part 4. Windows Updates and File Locking

Thursday, 10 October 2013

WYCRMS Part 4. Windows Updates and File Locking

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

4. Windows Updates

It's Lab Time!

Open a console on Windows (7, 2008, whatever), and enter the following as a single line (unfortunately wrapped here, but it is a single line):

for /L %a in (1,1,30) do @(echo %a & waitfor /t 1 signal 2>NULL) >numbers

What's that all about you ask? Well, it sets up a loop, a counter (%a) that increases from 1 to 30. On each iteration, the value of %a is echod to the standard output, and WAITFOR (normally a Resource Kit file, included in Windows 7/2008 R2) pauses for one second waiting for a particular signal. 2>NULL just means I'm not interested that the signal never arrives and want to throw away the error message (there's no simple sleep command I can find in Windows available to the console). Finally >numbers sends the output to the file numbers.

As soon as you press enter, the value 1 is written to the first line of a file called numbers. One second later, a line is over-written with the value 2, and so on for thirty seconds.

Now open another console in the same directory while this first is running (you've got thirty seconds, go!) and type (note, type is a command that is typed in, it's not a command for you to start typing):

type numbers & del numbers

If the first command (the for loop) is still running, you'll get the contents of the file (a number), followed by an error when the delete command is attempty - this makes sense as the first loop is still busy writing to the file.

This demonstrates a very basic feature of Windows called File Locking. Put simply, it's a traffic cop for files and directories. The first loop opens a file and writes the first value, then the second, then the third, all while holding a lock on the file. This is a message to the kernel that nobody else is allowed to alter the file (deletes, moves, modifications) until the lock is released, which happens when the first program terminates or explicitly releases the file.

This is great for applications (think editing a Word document that someone else wants to modify), but when it comes time to apply updates or patches to the operating system or applications it can make things very complex. As an example, I have come across a bug in Windows TCP connection tracking that is fixed by a newer version of tcpip.sys, the file that provide Windows with an IP stack. Unfortunately, if Windows is running, tcpip.sys is in use (even if you disable every NIC), so as long as this file is being used by the kernel (always) it can never be overwritten. The only time to do this is when the kernel is not running - but then how do you process a file operation (performed by the kernel), when the kernel is not available.

Windows has a separate state it can enter before starting up completely where it processes pending operations. Essentially, when the update package notices it needs to update a file that is in use it tells Windows that there is a newer version waiting, and Windows places this in a queue. When starting up, if there are any entries in this queue, the kernel executes them first. If these impact the kernel itself (e.g. a new ntfs.sys), the system performs another reboot to allow the new version to be used.

This is the only time a reboot is necessary for file updates. Very often administrators simply forget to do simple things like shut down IIS, SQL or any number of other services when applying a patch for those components. A SQL Server hotfix is unlikely to contain fixes for kernel components, so simply shutting down all SQL instances before running the update will remove the reboot requirement entirely.

Similarly, Internet Explorer is very often left running after completing the download of updates, some of which may apply to Internet Explorer itself. Even though this is not a system component, the file is in use and it is scheduled for action at reboot. Logging in with a stripped-down, administrative privileged account to execute updates removes the possibility that taskbar icons, IE, an explorer right-click extension or anything else is running that may impede a smooth, rebootless patch deployment that updates interactive user components.

This is simply a function of the way Windows handles file locking, and a bit of planning to ensure no conflicts arise can remove unnecessary reboots in a lot of cases.

Previous: Part 3. Console Applications, Java, Batch Files and Other Red Herrings

Wednesday, 9 October 2013

WYCRMS Part 3. Console Applications, Java, Batch Files and Other Red Herrings

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

3. Console Applications, Java, Batch Files and Other Red Herrings

Not to start this off on a downer, but I need to let you know I'll be insulting a few types of people in this post. I also need to make one things extra-clear: I hate Java.

The idea is great: write code once and run it on any platform without recompilation. Apart from the hideously long time it took for Java to come to 64-bit Linux, support is pretty good too. Sun (then Oracle) have been responsive in fixing bugs, but being a platform it is somewhat more difficult to roll out updates so huge amounts of obsolete JRE deployments are available for nefarious types and buggy software to run amuck. The reason I hate it is twofold: it allows developers to be lazy about memory management and rely on automatic garbage collection, and the fact that almost every application I've come across (except Apache Tomcat-based WARs) explicitly constrain support to certain platforms. This is not what I was promised.

When someone talks about "Windows running out of handles", "memory leaks", "stale buffers" or any number of technical-sounding pseudo-buzzphrases they almost always are trying to describe a software malfunction that appears as a Windows failure, or are simply too lazy to investigate and realise it almost invariably is caused by lazy programming. Java does this, but I don't blame Java, I blame Java programmers. The opinion is rife that Windows get less stable the longer Java applications run and that reboots are a Good Thing™. If someone genuinely believes that server stability can be impacted by poor software, but not report it to the vendor, I will inform that person he/she is lazy.

As I mentioned in Part 1, Windows engineers seem to scale their experience of Windows at home to their professional roles, and I've seen developers do the same. Windows doesn't do pipes very well, or they are language- or IDE-specific Outputting to the Event Log is slightly arcane and in fact requires compilation of a DLL to make most output meaningful. It's rarely used outside Microsoft themselves. So developers rely on consoles for display of meaningful output.

These consoles then become part of the deployment practice, perhaps wrapped in a batch file. If your program relies on a console window (and therefore a logged-in, interactive user session) or worse requires me to edit a batch file to apply configuration changes (as opposed to, say, a settings file parsed by said batch file), your software is nowhere near mature enough to be deployed on a system I would expect people to depend on. As a programmer, I question your maturity too.

It's people and organisations like that who typically have one response to issues that crop up: install the latest Service Pack, maybe the latest Windows Updates too (that fixes everything, right?), and if all else fails upgrade to the Latest Version of our software - don't worry that it's got a slew of new features that likely have bugs too, they'll be fixed in the next Latest Version. Rinse, repeat.

As a Windows Engineer, your job is to defend the platform from all attackers. That's not just bad folks out there trying to steal credit card numbers and use you as a spam bot, it's also bad-faith actors trying to deflect the blame from their own inadequacy. It's application owners prepared to throw your platform under the bus to hide their poor procurement and evaluation standards. It's users who saw a benefit in a reboot once and think it's a panacea.

It is in everyone's interest to call people out when they fail to deal with this stuff properly, or you'll quickly find yourself supporting a collection of workarounds instead of a server platform.

Previous: Part 2. Windows Just Isn't That Stable

Tuesday, 8 October 2013

WYCRMS Part 2. Windows Just Isn't That Stable

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

2. Windows Just Isn't That Stable

Ah BlueScreen of Death, how I've missed you. Actually, I haven't, since finding out what caused them was a nightmare, and recovering without a remote console solution is not conducive to a predictable social life (or sleep schedule). That said, they were so common we even had joke screen savers mimicking them for our own geekish amusement. Since Microsoft acquired Sysinternals they're even available to download directly from Microsoft. Imagine your in-car entertainment system being configured to show you fake warnings of a failed brake line, or a cracked cylinder head. "Would you like the free video package of Ford vehicles endangering passengers' lives with your new Focus sir?". IT people are weird.

I've analysed my Windows 7 x64 installation, and in the last three years I've had six bluescreens. Once was my graphics card (pretty unique), all the others were my Bluetooth headphones putting my cheapo-Bluetooth dongle in a spin. I blame the dongle, not Windows.

OK, that's not fair to the dongle maker: I blame Windows, but only the Bluetooth stack since it's never been something I expect Windows to do well - multiple dongle-headphone combinations have yet to produce a pleasant experience (three dongles, two headphone models). The network card, storage stack, print drivers, memory management, process scheduler (NUMA-aware these days apparently): These all work so well I haven't notice them doing their job, and I am very familiar with what a complex job they have.

I expect roughly once a month to see a BSoD on public transport, or at stations, or many airports, or billboards. The layout of the BSoD has changed over the years, with each version of Windows getting a little tweak so that you can spot the version even if the error itself is gibberish, and I conclude from viewing these blue non-advertisements: These systems tend to be A) old, B) written in languages and coding styles that aren't that good, and C) interface with devices with terrible drivers.

This is not typical of modern Windows servers.

I would never dream of subjecting a server to the amount of change my hard-working personal workstation endures. AMD updates my video drivers multiple times a year, I attach and detach USB/phone/iSCSI devices more often than I refill my car's tank, and run code from pretty much anywhere as long as it promises me utility or entertainment. A server is different, running things I trust to go on processing without attendance, cleaning up after itself, and basically staying up. If I do make changes, it's controlled, tested and left the hell alone.

Windows Server is solid, and every iteration gets more solid. It's expanding to 64-bit spaces, handling multipath-iSCSI with ease, more cores than I have fingers in byzantine NUMA layouts, hosting server instances in their own right with Hyper-V and pushing gigabytes around through network cards and storage interfaces, crunching data and most importantly providing services.

Yet the very people who spend time and money proving they are skilled in designing and administering these systems so that they can adorn their signatures and office receptions with impressive Microsoft-approved decals are the first to tell you not to trust a given server (without even knowing the workload or configuration) to remain available. They express surprise and concern on viewing a server continuously running for over a year.

I'm surprised and, yes, concerned that they react this way. Isn't this what your sales folks promised me in the first place?

Previous: Part 1: But I Have to Reboot My Own Windows System All the Time!

Monday, 7 October 2013

WYCRMS Part 1: But I Have to Reboot My Own Windows System All the Time!

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

1. But I Have to Reboot My Own Windows System all The Time!

I've mentioned before how Windows makes you lazy. One of the great things about Microsoft Windows as a platform is that software developed on a $500 workstation can be installed on a $50,000 server and probably work without problems. Of course, getting your home-brew software to scale is a different matter, but you get the idea: One platform, different size.

Almost every Windows engineer cuts their teeth on Windows at home, and this informs their experience and expectations of the platform. Like everyone I get tired of the bogging down after a few days/weeks/months uptime and reboot just to clear things up, but that's my fault and not Windows.

I'm lazy.

Typically, I'm running browsers, office suites, anti-virus, any number of games, and install new stuff roughly once a fortnight. Flash, Java and Windows Update are constantly pestering me to reboot after updates. I've even been the one to reinstall completely after a year to see the wonder of a zippy start-up and responsive GUI, only to have it slowly crawl as I add functionality (including those games). Happily, my Windows 7 installation has lasted two years by now with no significant falloff in responsiveness, so that's getting much better, and I only power down/reboot of my own volition when I'm fitting lights and need mains power off - even then it's more likely to be a hibernate.

Servers are not workstations. Any good enterprise has controls for how changes are made to IT systems, and even simple patching requires testing and approved windows to take the system down and update it. In my experience a server will undergo a major overhaul at most twice in its' operational lifetime, and organisations with exceptional controls have zero - new version? New server!

A good server (and I think of Windows Server 2003+ as good servers) will run for decades given quality power and no moving parts. Of course hardware fails, but Microsoft have put in man-decades to get Windows to handle routine changes without downtime. I remember Windows NT 4.0 needing a reboot for an additional IP address. Modern versions of Windows can hot-plug an entire NIC (physically) without a blink, though admittedly I've never actually encountered anyone who uses the facility.

If an engineer merely mentions that, in their experience, Windows needs rebooting I question their experience. I mean it: I question their experience!

Windows is solid, and I can recall only one confirmed bug where Windows will fail (actually, begin to fail, an outage is not a certainy) for the simple factor of running continuously for a given time. When someone speaks of a memory leak that has caused Windows to run out of (insert wooly term here), again I question their experience and the quality of the software/vendor driver code. I've stopped blaming Microsoft.

When I run my applications on Windows Server and, more importantly, when I am paying someone to manage those systems for me, I expect them to have faith in their products and promise me server availability. Rebooting breaks availability.

Previous: Why You Can't Reboot my Server

Why You Can't Reboot My Server

When I was an on-site server engineer in 1997, I stood next to a HP 9000 engineer waiting for a SCSI hard drive at our parts depot, and we got chatting about his next work order: He was off to install a tape drive. I asked him what the new hard drive had to do with it, and he mentioned that the server in question had been running continuously for over seven years, and at least one drive was likely to get stuck and refuse to spin again once he turned the frame back on.

I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

In this series of posts, I'll be looking at the most common complaints from Windows engineers and administrators they feel are adequate to justify rebooting servers, either as (or instead of) a diagnostic step, on a schedule that can best be described as arbitrary, or even artificially to apply fixes for problems the system doesn't have.

In this series:
Part 1: But I Have to Reboot My Own Windows System All the Time!
Part 2. Windows Just Isn't That Stable
Part 3. Console Applications, Java, Batch Files and Other Red Herrings 
Part 4. Windows Updates and File Locking 
Part 5. Nobody Ever Runs a Server That Long
Part 6. It's OK, the Resilient Partner Can Take Over
Part 7. I Don't Think You Understand What A Server Is

How to Make Your Customers Feel Like Meat in a Tube

Few things annoy me more than web-based forms for initiating customer contact. My experience of them ranges from poor to dismal, and even when I point out that I expect companies to fail in their response I am rarely surprised by brilliance (or even adequacy).

The first problem with these forms is actually the result: Your enquiry ends up not as an e-mail for a person, but a record in a database. Some forms are worse than others in betraying this, but if you even have to select your company size or decision-making company role you can be sure you're being slotted into a Customer Spamming Service machine.

From there, around four out of five responses make no reference to your original query. Unlike e-mail, where you can save your initial contact in your Sent folder, and typically hiting Reply generates a new mail on top of your original one, the first response you receive almost always has no history, so you're left scratching your head wondering if you really forgot to mention your product's model number, even when you remember having to look up the unicode for the unnecessarily accented é in the product name.Whether a human typed out your reply, selected a form response or some machine logic matched your keywords to information already available in the FAQ, I will offer odds, without knowing who the company is, that the original question is not included for reference.

It's a pain to fill in forms like these repeatedly for each individual question, so you might be tempted to put more than one question in your query. Beware traveller - the company will choose which answer most closely matches their prepared form responses and send that to you, regardless of the amount of prominence you try to give to the one you really need answered first.

Errors on the form? How about not residing in the US so skipping the "state" field, only to be told the field is mandatory. OK, I live in Wyoming, Netherlands. Ah, the form now tells me having a state filled in outside the US in an invalid choice? Check the dropdown - yep, only US states available and no way to not pick one. Don't bother complaining about the logic in your actual request - you see, the people choosing stock responses to send that don't adequately deal with your query, they're in no way connected with the end of the sausage maker that ruins your customer contact experience from the start. They just turn the handle.

While sending an enquiry to a prominent software vendor, I happened to have NoScript turned on and found the form broken beyond use. This is simply not justifiable. Oh well, I'll enable the site for JS, but lo! The form fails to complete again. This time it is because a piece of code from a marketing firm has not arrived. So prominent, it even has the name market in its' name - answering my question vs completing my digital profile for a third party: Which do you think they care about most?

All this from an IT Security company, that sells products to control mobile phone policies to stop users from doing things like installing untrusted software that sends their data to unknown parties, without telling you.

Why am I running a marketing company's JavaScript to collect my personal information to initiate an evaluation of your products?

I am just meat in a sausage to you people, aren't I?

Wednesday, 18 September 2013

Streetview and WiFi - Courts Need Some Education

I'm hanging my head in my palm in a manner not unlike the Jean-Luc Picard meme today, after reading a decision by a court in the United States. EFF has a great article summarising the effects, but I'd like to expand and go into the cause too.

So Google drove around with an antenna on their Street View cars for a few years, sniffing for wireless networks. This is very useful if you'd like to know your location but can't get a GPS signal, especially on devices with lower quality antennas or a location with poor sky visibility. As long as you have a data connection, you send the local Access Points off to Google's servers, and they will look up the location and feed it back to you. Simple, right?

Well, sniffing for access points (AP) is ridiculously simple. Your phone does that every time you look at a list of networks in your location, by looking for a special network frame called a beacon, which a regular access point send out roughly ten times a second. It's crucial to how WiFi works. It uses the same frame type as normal traffic, so even without a beacon you can still see at least the presence of traffic, and the MAC address associated with the AP. If the network is unencrypted, your WiFi network card automatically accepts those traffic frames too, then has it discarded by your kernel because it is not destined for your computer/smartphone/tablet etc ("your device").

I've done sniffing myself, in a practise known as "war driving". It sounds ominous, but it's also a very interesting excercise, for which I purchased a specific Atheros WiFi NIC, thanks to their products having excellent Linux support. Hook up a GPS receiver, and just go places. The software figures out where you are, looks at the list of APs nearby and pins them to a map. The problem here is that simply enabling your card to listen for APs does cause your system to store those traffic frames, since they are usefuol for determining the IP range in use on that network. Note I'm not trying to use those networks, I'm just interested in seeing how they're being used.

Now comes the interesting part: The court decided these signals being intercepted were not "radio communications" (despite being carried by photons) for the purposes of a legal interpretation, not "readily accessible to the general public" without "sophisticated hardware and software", and finally "most of the general public lacks the expertise to intercept and decode payload data transmitted over a Wi-Fi network".

On each point:
  • It's radio. Learn physics. The lower court's opinion that the law covered "predominantly auditory broadcast" is an inference form the court; 18 USC § 2511 mentions audio only for satellite transmission, and only then to describe an audio channel used as a carrier for digital communications therein. If I speak ones and zeros into my walkie-talkie, does this suddenly become digital communications? Your meaning is divorced from reality, and I think it suits an agenda instead of fact.
  • If I walk down a street, glancing into shops as I go, and see a person in one shop/office/etc handing a big photo with a red X through it to another person, with person in photo turning up dead and either of the two persons implicated at trial, I can't be prohibited for testifying because I got information not "readily accessible to the general public"(i.e. they were not standing on a street in plain view); a police officer seeing the same thing under the same circumstances is not prohibited from using the information for lack of a warrant or probable cause - it still doesn't affect the fact that photons bounced off the photo and got interpreted by "specialised hardware " (eyes) and "specialised software" (brain). Nobody's privacy is invaded, but information was blasted into the street nonetheless.
  • WiFi NIC cost: $10 (shipping costs vary), down to as low as one. These devices are sophisticated, don't get me wrong, but then most computers are mind-numbingly powerful today. Linux kernel cost: nothing (download costs may vary, but is unlikely to make you hit your ISP's download cap),  easily installed in an array of distributions. Laptop cost: Variable, but if you've got one lying around you already have your solution.
  • Interception is easy stuff. Decoding is easy stuff, and your device "decodes" it as part of its' primary function.
  • A further point the court asserted is that regular (AM/FM) radio communications can be received miles away, versus WiFi that "fail to travel far beyond the walls" of a location. Again, physics. Oh, and define "far" - my balcony is less than 15 meters from my AP and my signal comes and goes, while the street is 30 metres away (in the opposite direction) and I still get an association (but not much throughput) fairly reliably. I despise vagueness in court proceedings.
The fact is, sophisticated software is required to not receive the payload. If somebody decides to configure and use an unencrypted access point and I happen to walk past with my phone doing the searching (I left the WiFi on when I left my home), or researching which models of router or service provider is prevalent on the street, or finally if the NSA, FBI, federal or local officials are parked in a van across the road, simply turning on the function puts the traffic in RAM. Even if it destroyed a microsecond later, under this ruling I (and they) have broken the law. The bar for warrant-required searches just shot up.

I'm used to seeing courts being out of touch with reality, especially in computing cases, but this is beyond unreasonable. I have no doubt Google had zero intention of capturing user detail such as e-mails, usernames or passwords (why would they, they run an e-mail service?), and are now being prosecuted for users' inability to secure their own networks.

But mostly: Physics!

Thursday, 24 January 2013

Dawson College: What Island Are You On?

I've been viewing the growing story about Ahmed Al-Khabaz, a Computer Science student at Dawson College in Montreal, Canada, who was expelled for running a security scan against their public web presence to discover if a flaw he found was resolved. I stumbled on a subsequent interview with prominent IT Security professional Chris Wysopal. I hadn't heard of him, but when I saw he was previously associated with l0pht Heavy Industries my eyes snapped open.

This guy has credentials, and I don't know of an IT professional active around 1997 to 2003 who hadn't heard of, or actually used, l0pthcrack, often to solve real-world problems. First and foremost a password auditing tool it can be used maliciously, but the so can a toaster oven. It is a piece of code art: Necessary, useful and (at the time) industry-shaking.

White Hat hacking is a tricky business. Even I've done it, against a bank no less, fully in the knowledge that I was doing something the system owners would be very unhappy about. In some cases it can get you arrested. I was was pleased with the results when my concerns were taken seriously and fixed fairly quickly. I've worked in financial services companies and know their software release process is iceberg slow so this was very reassuring. There's one thing Mr Al-Khabaz and I both know that drives thousands around the world to the same end: I'm at risk.

Dawson College is hand-wringing and special pleading: "the law ... forbids us from discussing your personal student files" is in this case weak. I am pretty sure the former student would agree to a waiver of his right to privacy to clear the air, but I have seen no mention of an offer. Fourteen out the fifteen professors convened voted for his expulsion, for doing what some professionals get paid extremely well to do (even I've been offered this job): Evaluate the security of publicly-accessible websites. I would like someone better informed than me to comment on what the implications would be for the institution if it was discovered a breach because of this flaw caused losses thanks to the personal information disclosed.

I can appreciate that the college does in fact have to abide by law, and is unwilling to get into a mudslinging match in the public forum. They have rules for ethical behaviour that may have been violated (I haven't seen them). But beyond those considerations, every one of the fourteen professors needs to answer one simple question:

Why, if these actions are so outrageous of a Computer Science graduate that it demonstrates
"behavior that is unacceptable in a computing professional" has the company whose software flaws he exposed taken it upon themselve to pay for his further education?

Academia is often seen as disconnected from reality; some lines of research beggar belief, and the same could be said of Computer Science. I've met a few graduates who arrive in the IT industry ill-prepared, full of theory of operation and design but unable to command a command-line. No matter what their actual instruction is, a critical point they need to learn is that the Internet is a hostile place. It is also a collaborative place, where FOSS abounds and Creative Commons is richly rewarding. Poking around is the norm, and if this college is telling their students that they are to accept their instruction blindly without considering real-world implications, or use those skills to explore, then they don't deserve to be associated with the term Higher Education.

They may perhaps be able to educate Code Monkeys, but thinking professionals able to design and protect systems that impact their lives? Not really.

Wednesday, 23 January 2013

How Important is MariaDB? Let's test the fork with butter.

MySQL has interested me for quite a long time. I first came across it in 2000 when trying to find a better way to analyse the contents of a 20,000 user Active Directory and needed more relational DB-stuff than Microsoft Access could deliver and cheaper than SQL Server (wow MSDE was terrible). I was deeply impressed (though probably because I was easily impressed back then) with the performance and cross-platform support, and ever since it's been around my life.

I currently use it for my XBMC and Logitech Media Server (SqueezeBox) media databases, as the back-end for my Gallery3 site, and other ad-hoc databases whenever I need to crunch data. Before my 64-bit processor created a new ISA that ensured a reasonably complete instruction set, it was a favourite of mine for optimising binary compiles over the stock i386 build supplied by most distros, but more for interest's sake than actually squeezing performance for any measurable benefit.

MySQL AB was of course the owner of the copyrights and code and opted for a relatively unique license, both proprietary and open. As the owners of the code, they could choose to do this, but anyone trying to make a buck out of the code was obliged to release their modifications. Now that Oracle (through their acquisition of Sun, who acquired MySQL AB) have that right, the open-source community is in a bit of a fluster. Can we trust a corporate giant with custody of the code that runs a significant fraction of the Internet's websites? The answer is slowly coming down on the side of "no".

Oracle (and others, and unsurprisingly) is being guarded about bugs and fixes. Stories of vendors forcing customers into NDAs before even admitting bugs exist, hiding bugs from other customers, and silently including fixes are common. It's face-saving. Andy Grove's "Only the Paranoid Survive" starts off with how Intel hoped to keep their Pentium FPU bug quiet while they implemented a workaround simply smacks of arrogance. While it doesn't yet seem Oracle are trying to hide any actual code and still supply source, MySQL has historically had test cases for bugs published alongside them to protect against regression and anyone can run the suite on their installation to verify code quality. Not only are they apparently now keeping some cases secret, they are also not clearly marking which code updates fix bugs they are refusing to publish.

This is not how open-source works, but I don't agree with the prevailing rationale. RedHat came into the firing line for being less than open they handled a code signing infrastructure breach, but in that instance I support the way they behaved as it was not their source they concealed, rather their own systems and controls that were embarrassingly compromised. They have shareholders, and revealing too much would have cost them. Oracle too have value invested in their products and would like to keep flaws hidden. This is not nefarious, it's capitalism.

MySQL as a product is different, no matter who owns it. It is very closely tied to the spirit of the open-source movement, being both highly regarded for performance and features, and for the competition it gives proprietary offerings. For Oracle to claim that ground back is entirely within their right, but the edge is gone. The most ardent supporters and influencers of purchasing are not happy and a slow exodus may be starting.

So Fedora and Wikipedia are both contemplating pulling out. The MariaDB fork has all the features and more, is fully open in the original spirit of the project, and is attracting attention including mine. I have no idea how easy it will be to do the fabled "drop-in replacement" every source claims is possible but I feel ethically compelled to leave MySQL in the dust. I have a server that runs my digital life and it is a conscious choice to run on open software only and it has not been easy, but as an experiment and learning tool it is invaluable.

The great thing about open-source is anybody can fork. I can clone a source and apply my changes as I like, but the moment I try to give it to anyone else (especially selling the result) I have to disclose my whole body of work. This can lead to some confusion as the early days of Linux showed, but in the end the market weeds out the under-performers and delivers better products through sheer market forces. MariaDB seems to be that winner.

I do know one thing: testing the transition is going to be a breeze: After switching from Fedora to Gentoo four months ago, I rolled the root over to BTRFS (once kernel 3.6 gave me the necessary confidence). Add a distinct IP to the NIC, snapshot, chroot, and I've got a clone of my server ready to go in about two seconds without that system-level virtualisation stuff and hideously slow LVM2 snapshots.

Rollback to base for a fresh attempt? Yep, two seconds.