Thursday 26 January 2017

Don't Just Tell Me "It's Encrypted"

"It's Encrypted"
These two words are a trigger. They seem innocuous enough, but the devil is in the details, and the details of cryptography are notoriously byzantine. It's tempting to allow ourselves to be soothed by these words, but they fall well short of an assurance.
In 2002, Bill Gates sent a memo inside Microsoft introducing his vision for Trustworthy Computing, a form of which we now know as Trusted Computing (these are distinct). In addition to Security and Privacy, the Availability goal in that memo has since been likened to a light switch. You don't think about whether the power will arrive when you flip that switch, you simply trust that it will. It's hard to see the forest for the trees sometimes, but when you consider the immense complexity and level of engineering in the products we depend on fifteen years later (tablets, laptops, mobile phones, streaming music systems to name a few), it is a remarkable achievement.
The light switch analogy is instructive. Power generation and distribution is a gargantuan field, and consumes spectacular amounts of money to provide you with the simplicity and affordability of that light switch. Similarly, achieving the level of stability and performance in our computing devices has consumed millions of productivity hours and billions of dollars. Privacy, the often forgotten cousin of Security and Availability in the original memo, is expected to varying degrees by different audiences but nonetheless depended on for our daily lives.

"It's Encrypted" sounds like the light switch - problem solved, it's encrypted, I can rest. It is a rare day indeed where that claim is picked apart that I do not find (sometimes significant) issues with the claim. If not done correctly, badly implemented security (but encryption especially) can be worse than no security.
Cryptography is an esoteric field. As I write this, Mozilla is due to start warning users of websites that continue to use certificates signed with the SHA-1 algorithm. This is likely to be Greek to the average user, but to security professionals this is a reasonable step in the march forward, and if naming and shaming is required to prod those last holdouts into the future, then at least users are informed that these providers are not up to snuff. This stuff is ubiquitous, and few users realise that their $150 smartphone is loaded with some of the most advanced cryptography available to the average Joe. They depend on it often more than they know, so interventions can be useful.
The problems are not in the ciphers, and handshakes & protocols are continually improved to counter new threats while enabling new functionality. The biggest headache (and the most frequent downfall of "It's Encrypted" claims) is key management. As Bruce Schneier reminds us in Applied Cryptography: "In the real world, key management is the hardest part of cryptography."
Private keys in PKI are often treated with trivial regard. The most recognisable for administrators would be the PFX, or .P12 file. Incorrectly referred to as a "Certificate file" (a wildly inappropriate shorthand), these few kilobytes are dealt with as arbitrary packages that a system uses to function, as opposed to critical pins in the security apparatus. This type of file (PKCS#12) as designed, is intended for key archival, yet is often sent via e-mail, stored on accessible drives to ease administration and troubleshooting.
On the other hand, Pre-Shared Keys are frequently recorded when they are not strictly required - if a system has issues, generate a new pair, then discard any record, though only a storage corruption would necessitate this, in which case you have bigger problems. TLS1.2 recommends this with their Ephemeral Keys, but IPSec VPN administrators have yet to learn the lesson and these are still recorded as if they will be required in the future, undermining the confidentiality and integrity of the link.
I worked on a proposal for a customer who demanded full encryption of all storage volumes in a cloud hosting provider, and quickly determined the theory and practise of these cryptosystems prohibited a robust solution. Of course, it could be fudged (and eventually was over my objections), but it took me a while to realise that the actual integrity of the system was not the point. It was merely to be able to make the claim "It's Encrypted".
The eventual implementation made use of "self-decryption". This is not a thing. By all means, system rely on obfuscation to hide key material and defeat trivial intrusions, but if a system is aware of how to decrypt its' own resources, an attacker will be able to achieve the same goal. The Trusted Computing Modules located on modern mobile devices, with the built-in Hardware Security Module, provides a slick solution to this question in performing both assurance that the system is trustworthy, and providing the eventual decryption keys once this assurance is performed. A virtual environment has not such opportunity.
Another issue in the eventual implementation was the use of a central key repository, using COTS. Since the designers have no real training or experience in cryptography or security principles, they made the fatal mistake of trusting the systems they were trying to protect too much. Every server is hosted at a provider, and on startup will request access to a database of keys over CIFS. Authentication is handled by means of the computer AD account. Since the single database is decrypted in its entirety by any participating server, every server has access to the keys for every other. This is disastrously insecure; the cloud provider has administrative access to these systems, and it is against their view that the cryptosystem is supposed to protect; ISVs and integrators are frequently given administrative credentials to install and test their products, again allowing them to retrieve as many keys as they please should a nefarious actor be present.
But hey, "It's Encrypted".

Monday 13 October 2014

Posturing over Encryption Defeats Everyone's Security

Apple and Google announced in short succession that they will be turning on encryption by default for all of their devices and the reaction from law enforcement (and some press elements who pander to them) is nothing short of incendiary. The nexus of national security, organised crime, think-of-the-children and technology concpepts barely understood by laypeople is fascinating to watch but one thing stands out that makes this posturing baffling: Law enforcement already knows what the limitations are of modern cryptography on mobile devices, and they're borderline lying when they decry this latest move.

Encryption on Android devices at least is very strong relying on the deep structures in the Linux kernel and associated tools to create and manage encrypted storage devices. Apple have improved things a bit in iOS8 but they've a ways to go. The announcements change nothing about the capabilities of these devices that law enforcement don't already worry about but that hasn't stopped them from launching a PR push in an attempt to show they're actually doing something - anything - to tackle the problems under their purview.

There is a serious flaw in this. The announcements merely indicate the starting position of the state of a new device, in that existing methods for securing the device will be turned on by default. Any savvy user who cares for their privacy (and unscrupulous ones probably more so) will have long ago figured out how to activate these features. In the grand tradition of DRM, where normal users were left exposed or under-served while underhanded consumers found the best release schedules by simply downloading as they wanted, viewing on devices that pleased them, and avoided limitations like being forced to watch ads or in formats sub-optimal; regular users of mobile devices were left woefully underprotected despite the facility being there to enhance their device security for the good while the security conscious (and I will keep admitting not always those with high-minded intentions) could exploit these features for their own protection.

Law enforcement knows this. Their hand-waving (and predictable, think-of-the-childrenness) responses are at best facile and at worst outright lies. If they did not know these devices were already capable of encryption then they are not competent, and if not but think this default setting changes the landscape for law enforcement then they are certainly dishonest - someone choosing to hide data from law enforcement already has those tools. These decisions are security for the rest of us.

This is not new. I used to have respect for Dianne Feinstein, the present (as of October 2014) chair of the US Senate Select Committee on Intelligence. She once remarked that Edward Snowden should have come forward with his concerns privately, even directly to her, rather than disclosing his information to the public; that this would still have resulted in the reforms and protections now underway at the NSA, CIA and other intelligence bodies in the US. Again, either incompetent in thinking the level of scandal this information would cause would be outweighed to safeguard individual rights in an environment of total secrecy (it won't) or posturing for effect and misleading the public about what she knows to be a false tale of her and her colleagues' desires to curtail intelligence gathering if it infringes those rights (it would never). Cue outrage when the CIA is found to be violating those rights, this time hers. She cannot be taken seriously.

Neither can any government official who says that the two largest smartphone OS vendors are hurting law enforcement because of a non-technical change. They are posturing, lying or showing incompetence (at best because of bad advice, but unlikely). They would like to intentionally leave us exposed to data theft, privacy violations based on the flimsiest of evidence, insufficient safeguards for that data and abuse by criminals who use government and law enforcement's own tools and methods.

I don't blame them for this as it's their job to pretend we're all at risk, that their job is hard, and that we should trust them. I know this and can't fault them (too much) for it.

My true outrage is with media outlets that don't just blindly parrot their talking points, but add on their own ill-informed scaremongering flourishes. None of these articles mention the present availability of these techniques, and in the omission imply that this is a new level of crime-friendly protection with no benefit for ordinary users. If these tech journalists stand by their stories and insist government needs to take action to cripple security or mandate backdoors, history has a lesson for you, as do real tech journalists.

Tuesday 5 November 2013

WYCRMS Part 7. I Don't Think You Understand What A Server Is

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

7. I Don't Think You Understand What A Server Is

It's taken me a long time to find the right words to explain this title, because it's a bit contentious. This is a longer article, but I hope it's of value and I hope expresses my sense of pride and love of computing.

I've explained in previous posts that Windows Server is accessible to those new to IT engineering because it is a simple learning path from the system most of us have in our homes, schools, libraries and elsewhere. Moving to depending on a server to provide service takes different people different amounts of time to get right, and some organisations are more tolerant of slip-ups than others.

There is something that starts to take over. It's not obvious, and since it applies to something commonplace and often the subject of passion for those true geeks, it is also often unrecognised in oneself.

Fear

In no other industry would a consumer or customer tolerate advice such as rebuilding a product, service or transaction because of a minor fault. No mechanic would expect to be paid if they told their customers that the brakes may fail after an hour on the freeway, advising the driver to just come to a complete stop and then move off again, yet this is what we in the (Windows) IT Service industry resort to all-too-otfen. It's a mild paranoia that all things Microsoft (and others, as I will show) are prone to misbehave, and I've been told this explicitly, in writing, more than once from multiple providers after I've requested a feature or function be deployed. Providing vendor documentation in favour of the product's capabilities doesn't sway these types, as they've experienced their own horror stories of staying up until 2am while functionality is restored. I know, I've been in that trench.

But imagine for a moment if you'd gone to a garage affiliated directly with your car's manufacturer. A Ford mechanic providing me the aforementioned brake-fault workaround would be held to account if he also displayed his certification of training or affiliation with Ford Motor. I could challenge him and go to his management to insist that real advice and a fix is provided for my product. If he could show you that his advice was sound my beef would be with Ford selling me an underperforming product. I'd have recourse.

IT engineers seem to think they don't, either in being the place to receive this criticism (hey, they didn't write the software they just run it), or in being able to back up their fears with proof that vendor products in fact don't behave as they would hope. Yet the offices of IT service providers proudly display partner certifications, while engineers with IT credentials flowing of their resumes continue to fear the products that are their livelihood and the foundation for the logos they too show off with some pride. Doublethink indeed.

I am very active in reporting bugs in open-source and even paid-for products, because I expect that the product is only good as those who help make it better. I've already mentioned that failure to even start reporting faults to vendors is negligent, and how engineers should have more pride in their platforms and confidently defend it from detractors with authoritative sources. What I haven't spoken about is the relationship those engineers have with the platform itself.

There is a widely held belief in ICT over a decade old that is demonstrably false: Ethernet autonegotiation is quirky, and potentially dangerous. There was a time this was true, but not for at least five years since 100Base-TX lost it's leading position in datacenter, server and finally desktop connectivity method. Implementing Fast Ethernet (as 100Mb Ethernet was known) needs some knowledge of standards. When the Institute of Electrical and Electronics Engineers (IEEE) published their 802.3u standard for Fast Ethernet, I recall an interview with one of the panel members who stated that it is technically possible to run 100Mbit over stacheldracht (German for barbed wire) and you may have success. He made it clear though that your experience cannot be guaranteed as it is not 802.3u-compliant.

That's the crux: when a vendor states something as a standard, part of a reference architecture, or included in their documentation, they're making a promise.

The section of the new standard dealing with how the two nodes selected the operating speed and duplex setting was, unfortunately, not precise enough and open to some interpretation. Cisco and a few other vendors chose one interpretation while everyone else chose another, and the resulting duplex mismatch is notoriously hard to diagnose, occurring as it does only at moderate load and a ping test over an idle cable will likely succeed. It's insidious, and resulted in the universal abandonment of Autonegotiation in implementations (especially datacenters and core networks).

The problem is, Autonegotiation is not only working well in Gigabit Ethernet (over twisted-pair copper, or Cat5e/Cat6 cabling), it is mandatory. Even network professionals, burnt previously in the 90s and later with Fast Ethernet, advise against turning on a feature that is explicitly required to be a truly standards-compliant implementation, with all the promises attached. A prime reason is that the applicable line is buried deep inside section 28 of the IEEE-802 standard, as amended for Gigabit. It's dry reading...

Gigabit Ethernet was a big jump forward that started to seriously tax memory buses and CPUs like no other iteration before, and includes a highly valuable feature known as a Pause Frame to stop transfers flooding receive buffers and being dropped. This facility is only used if the opposite end cooperates, and the only mechanism to advertise this is autonegotiation.

I've seen an implementation of Microsoft Exchange 2010 come to its knees for lack of Pause Frames, and it is again an insidious failure since packets are only dropped under load, and ping tests and even high-load throughput tests succeed. It is the clinging to an old wisdom without knowing the cause, and then failing to keep up with developments that has caused this issue. Not running with Autonegotiation means you aren't running a standards-compliant Gigabit Ethernet network, and all promises are void.

Not following vendor advice is a bad idea. If the vendor promises a feature that you feel is not ready for primetime then by all means hold off. But if I expect something to work that a vendor promises will work, I don't expect to be told war stories of how this breaks - especially when I last saw that issue, myself, over 12 years ago. It's old thinking, stuck in past fears, and it's stopping you from unleashing your platform's potential. Windows Server especially has become a solid, dependable and performant platform, yet doubts linger and fears cling to dark corners, an uneasiness that is sometimes not even apparent to those harbouring it.

I enjoy reading on the history of computing, and contemplating how modern computers implement both Harvard and von-Neumann architectures depending on how closely you're looking. It's esoteric to speak of privilege rings or context switches, but knowing these things has been of immense help to round out my understanding of computing and gain trust in the models deployed. But the biggest thing I would like to see engineers embrace is this:

The Turing Machine

It's a simplistic representation of any computer, from your old calculator wristwatch to supercomputing clusters: A processor reads instructions from a sequence, implements those instructions with some data, and stores the result somewhere before moving to the next instruction. The next instruction may be a pointer to a different instruction, but all of computing boils down to this concept. there may be more than one processor, and there may be complex layouts of memory, but at its most basic form every computer works this way, and building on your model of a system's internals starts here.

It is deterministic, in that the state after an instruction is performed can be predicted from the initial state. In principal all of computing conforms with this principle, and any unexpected behaviour simply means the initial state was not well-understood enough. It is this mountain that engineers need to climb to truly excel at their profession, and I've met some expert climbers in my time. They have no fear of digging down to each root cause, and unearthing an even deeper root.

Rebooting is not the answer. It indicates a lack of knowledge on cause of faults. It is a sign of an unwillingness to investigate further. Worst, it is a misunderstanding of what your server is, what is meant to do, and the longer you allow that mentality to perpetuate the worse off you will be.

Old tales have value, but they are no substitute for knowledge and verifiable fact. If those facts contradict your experience, investigate, shout at vendors, check your implementation.

But most of all, be proud of your platform, because as obscure as It appears ot be, it is genuinely not that hard if you are willing to do better.

Previous: Part 6. It's OK, the Resilient Partner Can Take Over

Thursday 17 October 2013

WYCRMS Part 6. It's OK, the Resilient Partner Can Take Over

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

6. It's OK, It's One of a Resilient Pair

No, it's not.

So I have two Active Directory Domain Controllers. They both run DHCP with non-overlapping scopes, have a fully replicated database and any client gets both servers in a DNS query so load-balancing is pretty good. Only they're not a single service. They may be presented to users as such but they are distinct servers, running distinct instances of software that, apart from sending each other directory updates themselves, don't cooperate. A user locates AD services using DNS, and simply rebooting a server leaves those DNS entries intact. You've now intentionally degraded your service (only half of your nodes are active), without giving your clients a heads-up.

Sure, DNS timeouts will direct them to the surviving node eventually but why would you intentionally degrade anything when it's avoidable? It's only thirty seconds you say? Why is this tolerable to you?

Failover cluster systems are also not exempt. One of the benefits of these clusters  is that a workload can be moved to (proactively) or recovered at (after a failure) another node. Only failover clustering is shared-nothing, so an entire service has to be taken offline before it is started on the other node. Again this involves an outage, and as much as Microsoft have taken pains to make startup and shutdown of e.g. SQL Server much quicker than in the past, other vendors are likely not as forgiving. It's astonishing how quickly unknown systems can come to rely on the one you thought was an island, but suddenly don't know how to handle interruptions.

Only in the case of an actively load-balanced cluster can taking one node down be said to be truly interruption-free. When clients request service, the list of nodes returned does not contain stale entries. When user sessions are important, the alternate node(s) can take over and continue to serve the client without a blip or login prompt. In case you're confused, this is still no reason to shut down a server anyway, refer to the previous five articles if you're leaning that way, and if you're thinking leaving a single node to keep on churning the load then you haven't quite grasped what resilience is there for.

The point of a resilient pair is that it is supposed to survive outages, not as a convenient tool for admins to perform disruptive tasks hoping users won't notice. There's a similar tendency for people to use DR capacity for testing, without considering whether the benefits of that testing are truly greater than the reduction or elimination of DR capacity.

Application presentation clusters (e.g. Citrix XenApp) is a favourite target for reboots, and is the most often-cited area where these reboots are best-practice. Here it is: The only vendor-published document I have found in the last five years current software advocating a scheduled reboot. Citrix' XenDesktop and XenApp Best Practices Guide page 59. It is poor to say the least:

A rolling reboot schedule should be implemented for the XenApp Servers so that potential application memory leaks can be addressed and changes made to the provisioned XenApp servers can be reset. The period between reboots will vary according to the characteristics of the application set and the user base of each worker group. In general, a weekly reboot schedule provides a good starting point.

More imprecise advice is hard to find in technical documents. How exactly does the administrator, engineer or designer know the level of his "potential" exposure to memory leaks? I've spent some time exploring this issue in the previous articles, and I stand by my point - if an administrator tolerates poor behaviour by applications or - worse - the OS itself without making an attempt to actually correct the flaw (e.g. contacting the vendor to demand a quality product), that administrator is negligent, scheduled reboots are a workaround, and nobody can have a reasonable expectation of quality service from that platform.

But most of all: How are you ever going to trust a vendor who has so little faith in their product that it cannot tolerate simply operating? I'm not singling out Citrix here, but their complacency in the face of bad code is shocking. I admire Citrix, so I'm not pleased at this display of indifference. Best practice I guess...

Then we get to sentences two and three of this three-sentence paragraph, which informs our reboot-happy administrator to try a particular schedule without a definitive measure in sight. There's a link on how to set up a schedule and how to minimise interruption while it happens, but not one metric or even a place to find them is proposed. He/she is given a vague "meh, a week?" suggestion with zero justification, apart from being "feels-right"-ish.

If a server fails, it is for a specific reason. Sometimes this is buried deep in kernel code, the result of interactions that can never be meaningfully replicated, or  much more exotic reasons. In most cases however it is because of a precise reason (memory leaks included), and computing is honestly not so hard that these cannot be fixed.

You might tell I'm an open-source advocate, because I firmly believe in reporting bugs. I also believe I ge tto see the response to that bug. I've found some projects to be more responsive than others, but generally if I've found something that is not just broken but damaging I see people hopping to attention - and that's people volunteering.

If you're buying your software from a vendor they have that floor to start from in their response to you. Tolerate nothing less than attention, and get your evidence together before they start pointing fingers.

When you work in a large organisation you realise things have designations and labels for a reason. Resilient pairs are for unanticipated failures, and DR servers are for disasters.

You don't get to hijack a purpose just because it's unlikely it will be needed - they exist precisely for the unlikely.

Previous: WYCRMS Part 5. Nobody Ever Runs a Server That Long

Monday 14 October 2013

WYCRMS Part 5. Nobody Ever Runs a Server That Long

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

5. Nobody Ever Runs a Server That Long

Uptime

IT Engineers can take this stuff really seriously. I was quite proud of my own server at home that ran, uninterrupted, for one year and three months, during which time I upgraded the RAID array without any filesystem disruption, hosted at least twenty different types of VM (from Windows 3.11 to Windows 2008), served media files and MySQL databases, shared photos with Apache and personal files with Samba. Only the fast-moving BTRFS in-kernel driver broke me from that little obsession, but you don't need to run a Unix-variant to get that kind of availability.

Windows admins are simply used to "bouncing" computers to correct a fault. Hey, it works at home right? It's a complacency, a quick-fix, and often in response to an urgent need to restore service - troubleshooting can and often does take a back seat when management is standing over your shoulder.

Since Windows was a command launched at a DOS prompt, the restart has been a panacea, and often a very real one that unfortunately actually works. It almost always adds no insight into the fault's cause. Perhaps there's a locked file you're not aware of preventing a service from restarting, or a routing entry updated somewhere in the past that isn't reapplied when the system starts up again; there are myriad ways that a freshly started server can have a different configuration from the one you shut down, allowing service to resume.

Once a server is built, it is configured with applications and data, then generally tested (implicitly or explicitly) for fitness of purpose. Once that is done, it processes a workload and no more attention is required assuming all tasks such as backups, defragmentation etc are scheduled and succeed. Windows isn't exactly a finite-state machine (in the practical sense), but it is nonetheless a closed system that can only perform a limited set of tasks and failure modes should be easy to predict

Servers are passive things. They serve, and only perform actions when commanded to. Insert the OS installation DVD, run a standard installation, plug in a network cable, and the system is ready for action. Only it's not configured to take action just yet - it's waiting. In this state, I think most engineers would expect it to keep waiting for quite some time - perhaps a year or more. But add a workload - say a database instance or a website, and attitudes change.

I've had frequent discussions with engineers who will tell me things like "this server has been up for over a year, it's time to reboot it". Somewhere between an empty server and a billion-row OLTP webshop is a place where the server goes from idle and calm to something that genuinely scares engineers - just for running a certain amount of time.

When pressed for exactly which component is broken (or likely will) that this mystic reboot is supposed to fix, I never get anything specific, just a vague "it's best practice".

Windows Updates are frequently cited as a reason to reboot servers, and thanks to the specifics of how Windows implements file locking yes the reboot there is unavoidable. This leads to the unfortunate tendency to accept reboots as a normal part of Windows Server operation, but tend to see the reboot as the point (with an update thrown in since the server is rebooting anyway) instead of an unfortunate side-effect. I realise the need to keep servers patched, but again, when pressed for a description of which known defects (that have actually or could probably affected service) a particular update application - with associated downtime - will fix, the response comes in: "Um, best practise?".

In the absence of an actual security threat, known defect fix or imminent power failure, I am rarely convinced to shut a server down. I first included "vendor recommendation" in that list, but realised I've yet to see one. Ever.

Even at three in the morning when no sane customer could be relying on a system, during a once-quarterly change window when all services are nominated unavailable so service providers can make radical changes, even then: No, you can't reboot my server.

If engineers took the time to think about where in the continuum from empty server to complex beast the point of fear arrives, they can figure out which bit is scaring them and make sure those are well-understood, properly configured and maintained.Unfortunately, that takes time, effort and sometimes a bit of theory and modelling. Rebooting is so much less effort.

Windows Server can, and should, be expected to remain ready for service for as long as the hardware can last. With the advent of virtualisation and VMotion, even that obstacle is gone, and the limits are practically nowhere to be found. Applictions are another story, and if the developer/support specialist think they need restarting, that's fine, but they have zero authority to suggest this for Windows.

I've heard the phrase "excessive uptime" identified as the root cause of outages. I doubt Microsoft would like to know the engineers they certify are - and I don't say this lightly - genuinely afraid of their product doing its job, as designed, for years. If that doesn't happen and a genuine OS fault occurs that only a reboot can solve, it is quite shocking how many engineers will actually report this problem to the vendor and tolerate workarounds, design hacks and cludgy scripts.

In the same way that one can learn a procedure for changing the spark plugs on a specific model of engine, while completely missing the black smoke ejected from the exhaust thanks to a chronic ignition timing failure, so too engineers who have not yet attained a mediocre grasp of computing theory can continue to diagnose and treat only symptoms.

A server failing to continue to do its core function of staying up is not a mild symptom that a reboot can fix. It is a fundamental failure of the product, and failing to do the hard thing of actually understanding why and demanding the vendor improves their product does nobody any service.

In fact, it's negligent.

Previous: Part 4. Windows Updates and File Locking

Thursday 10 October 2013

WYCRMS Part 4. Windows Updates and File Locking

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

4. Windows Updates

It's Lab Time!

Open a console on Windows (7, 2008, whatever), and enter the following as a single line (unfortunately wrapped here, but it is a single line):

for /L %a in (1,1,30) do @(echo %a & waitfor /t 1 signal 2>NULL) >numbers

What's that all about you ask? Well, it sets up a loop, a counter (%a) that increases from 1 to 30. On each iteration, the value of %a is echod to the standard output, and WAITFOR (normally a Resource Kit file, included in Windows 7/2008 R2) pauses for one second waiting for a particular signal. 2>NULL just means I'm not interested that the signal never arrives and want to throw away the error message (there's no simple sleep command I can find in Windows available to the console). Finally >numbers sends the output to the file numbers.

As soon as you press enter, the value 1 is written to the first line of a file called numbers. One second later, a line is over-written with the value 2, and so on for thirty seconds.

Now open another console in the same directory while this first is running (you've got thirty seconds, go!) and type (note, type is a command that is typed in, it's not a command for you to start typing):

type numbers & del numbers

If the first command (the for loop) is still running, you'll get the contents of the file (a number), followed by an error when the delete command is attempty - this makes sense as the first loop is still busy writing to the file.

This demonstrates a very basic feature of Windows called File Locking. Put simply, it's a traffic cop for files and directories. The first loop opens a file and writes the first value, then the second, then the third, all while holding a lock on the file. This is a message to the kernel that nobody else is allowed to alter the file (deletes, moves, modifications) until the lock is released, which happens when the first program terminates or explicitly releases the file.

This is great for applications (think editing a Word document that someone else wants to modify), but when it comes time to apply updates or patches to the operating system or applications it can make things very complex. As an example, I have come across a bug in Windows TCP connection tracking that is fixed by a newer version of tcpip.sys, the file that provide Windows with an IP stack. Unfortunately, if Windows is running, tcpip.sys is in use (even if you disable every NIC), so as long as this file is being used by the kernel (always) it can never be overwritten. The only time to do this is when the kernel is not running - but then how do you process a file operation (performed by the kernel), when the kernel is not available.

Windows has a separate state it can enter before starting up completely where it processes pending operations. Essentially, when the update package notices it needs to update a file that is in use it tells Windows that there is a newer version waiting, and Windows places this in a queue. When starting up, if there are any entries in this queue, the kernel executes them first. If these impact the kernel itself (e.g. a new ntfs.sys), the system performs another reboot to allow the new version to be used.

This is the only time a reboot is necessary for file updates. Very often administrators simply forget to do simple things like shut down IIS, SQL or any number of other services when applying a patch for those components. A SQL Server hotfix is unlikely to contain fixes for kernel components, so simply shutting down all SQL instances before running the update will remove the reboot requirement entirely.

Similarly, Internet Explorer is very often left running after completing the download of updates, some of which may apply to Internet Explorer itself. Even though this is not a system component, the file is in use and it is scheduled for action at reboot. Logging in with a stripped-down, administrative privileged account to execute updates removes the possibility that taskbar icons, IE, an explorer right-click extension or anything else is running that may impede a smooth, rebootless patch deployment that updates interactive user components.

This is simply a function of the way Windows handles file locking, and a bit of planning to ensure no conflicts arise can remove unnecessary reboots in a lot of cases.

Previous: Part 3. Console Applications, Java, Batch Files and Other Red Herrings

Wednesday 9 October 2013

WYCRMS Part 3. Console Applications, Java, Batch Files and Other Red Herrings

In 1997, a HP 9000 engineer wouldn't blink telling about a server that had been running continuously for over five years.I found this remarkable at the time, and couldn't imagine a Windows server lasting that long. I have moved on, and frankly expect my Windows servers to survive that long today. Very few share this position, and I'm trying to find out why it's so lonely on this side of the fence.

3. Console Applications, Java, Batch Files and Other Red Herrings

Not to start this off on a downer, but I need to let you know I'll be insulting a few types of people in this post. I also need to make one things extra-clear: I hate Java.

The idea is great: write code once and run it on any platform without recompilation. Apart from the hideously long time it took for Java to come to 64-bit Linux, support is pretty good too. Sun (then Oracle) have been responsive in fixing bugs, but being a platform it is somewhat more difficult to roll out updates so huge amounts of obsolete JRE deployments are available for nefarious types and buggy software to run amuck. The reason I hate it is twofold: it allows developers to be lazy about memory management and rely on automatic garbage collection, and the fact that almost every application I've come across (except Apache Tomcat-based WARs) explicitly constrain support to certain platforms. This is not what I was promised.

When someone talks about "Windows running out of handles", "memory leaks", "stale buffers" or any number of technical-sounding pseudo-buzzphrases they almost always are trying to describe a software malfunction that appears as a Windows failure, or are simply too lazy to investigate and realise it almost invariably is caused by lazy programming. Java does this, but I don't blame Java, I blame Java programmers. The opinion is rife that Windows get less stable the longer Java applications run and that reboots are a Good Thing™. If someone genuinely believes that server stability can be impacted by poor software, but not report it to the vendor, I will inform that person he/she is lazy.

As I mentioned in Part 1, Windows engineers seem to scale their experience of Windows at home to their professional roles, and I've seen developers do the same. Windows doesn't do pipes very well, or they are language- or IDE-specific Outputting to the Event Log is slightly arcane and in fact requires compilation of a DLL to make most output meaningful. It's rarely used outside Microsoft themselves. So developers rely on consoles for display of meaningful output.

These consoles then become part of the deployment practice, perhaps wrapped in a batch file. If your program relies on a console window (and therefore a logged-in, interactive user session) or worse requires me to edit a batch file to apply configuration changes (as opposed to, say, a settings file parsed by said batch file), your software is nowhere near mature enough to be deployed on a system I would expect people to depend on. As a programmer, I question your maturity too.

It's people and organisations like that who typically have one response to issues that crop up: install the latest Service Pack, maybe the latest Windows Updates too (that fixes everything, right?), and if all else fails upgrade to the Latest Version of our software - don't worry that it's got a slew of new features that likely have bugs too, they'll be fixed in the next Latest Version. Rinse, repeat.

As a Windows Engineer, your job is to defend the platform from all attackers. That's not just bad folks out there trying to steal credit card numbers and use you as a spam bot, it's also bad-faith actors trying to deflect the blame from their own inadequacy. It's application owners prepared to throw your platform under the bus to hide their poor procurement and evaluation standards. It's users who saw a benefit in a reboot once and think it's a panacea.

It is in everyone's interest to call people out when they fail to deal with this stuff properly, or you'll quickly find yourself supporting a collection of workarounds instead of a server platform.

Previous: Part 2. Windows Just Isn't That Stable