White-Coat Computer Science

Those who test products and changes before rolling them into production stand a higher chance of continued employment. Use that technical version of Darwin’s natural selection to your advantage.

Since the release of Windows NT Service Pack 4, the engineering team had been hard at work making sure the new product would be compatible with all the current hardware standards and mission-critical applications in use throughout the enterprise. After months of compatibility testing in the lab, the project was finally passed over to the software distribution team for global deployment. Microsoft Systems Management Server pushed the job out to the workstations with little problem; it was now time to focus on the servers. The site in Bangor, Maine was the first to deploy—and thus the first to witness the “blue screen” boot failures on some of the older Compaq servers.

Another Methodical Approach
For another perspective on a methodical approach to your work, read Lee Christopher Grant’s exclusively online article “Survive Chaos.

It never fails that efforts in the controlled safety of the lab often don’t yield the same results when we apply them to a production environment. Despite our best efforts to look at the task at hand from every angle, we tend to run into problems that cause us to be up all night racking our brains about where we diverged from the beaten path. When the problem is finally solved, we often find that the issue was caused by some incompatibility that was either well known (except to us), or we realize that the lab configuration didn’t accurately reflect our production environment.

Many of the horror stories we hear regarding production environment failures during deployments come not from lack of knowledge or skill, but because of some divergence from what was expected. Microsoft’s claim that Service Pack 4 (SP4) was a simple upgrade shouldn’t have freed you from having to test the product in your environment. You may have applied SP4 successfully to your desktop, but when you applied it to the file server hosting all of the executive’s home directories, you witnessed the blue screen of death. After standing in the data center scratching what remains of your hair for the balance of the night and trying everything under the sun short of voodoo, you receive the dreaded call. It’s the director of IT, asking, “Why can’t I access my home directory?” You don’t really want to tell him that you never tested it on this hardware platform, do you?

For those who don white lab coats for a living, existence is dependent not upon work done in the lab, but on the ability to repeat experiments successfully on demand. Successful scientists maintain pristine laboratories and document every step of every process they perform to assure that their results will be repeatable if success is attained. If a scientist believes she found the cure for cancer, wouldn’t it be a shame if the results were unrepeatable? Did she really find the cure if she can’t repeat the findings of the experiment?

When we explain to our peers, customers, and bosses that a procedure worked in the lab but doesn’t work as planned in the production environment, our credibility is put at stake. As technologists, we’re typically a financial liability to an organization, unless we work for a contracting firm whose business is to sell our services. We rarely make any money for the organization, but instead we must justify our existence within the enterprise for the value our work adds to existing business processes. We build solutions that enable business users to do their work more efficiently, allowing them to spend more time on the profit-generating business processes rather than on the tools needed for the job.

Avoiding TechnoDarwinism

If you prefer to fly by the seat of your pants rather than apply some basic scientific principles to your work, Darwin’s theory of natural selection will work against you within your organization. Quite simply, those who test products and changes before rolling them into production stand a higher chance of continued employment. On the converse side, those who choose to take their chances by failing to test a product before deploying it in a production environment quickly fall victim to Darwin’s theory of natural selection. These are the individuals often “selected” to leave the organization after failing to grasp the importance of applying scientific principles to their work.

In any well-devised deployment plan, there should always be time reserved for research and testing. But when things run late, lab time is usually the first item to get cut. Most project managers seem to think that the week of testing you entered on your deployment project plan is merely a code word to describe the extra time added to every project plan to accommodate our inability to accurately predict the unknown. Immediately he targets this seemingly bogus entry for deletion or reduction from the project plan.

Inevitably, once you move your project from the development domain into production, a host of unforeseen circumstances keeps you from seeing daylight for the next few days. This prevents the project from completing anywhere near the milestone set by the project manager, raising questions as to whether or not it was truly worth it to cut out that week of pre-production testing.

All too often, the work we do is so new or unique that we can’t accurately estimate the time we’ll need or the obstacles we’ll encounter along the way. Did the NASA scientists accurately estimate the time or money required to put the first man on the moon? The moon landing proved to be an event that NASA would repeat, and inevitably, the knowledge gained from the first mission would benefit the time and resource estimates for subsequent missions. Armed with a bit of knowledge learned from our own lab experiments, we too can begin to benefit from our previous experiences.

For systems administrators, there’s often little reason why we can’t practice in a non-critical environment to prepare ourselves for the pitfalls that may lie ahead in the upgrade. Not to say that every upgrade, migration, and deployment will go smoothly if we practice it once or twice in the lab—there will always be unforeseeable problems. But generally speaking, significant amounts of practice beforehand will yield a better success ratio for our efforts than if we just give it a try and see what transpires.

The time to research incompatibility issues, test changes to the environment, and devise disaster plans isn’t after the event occurs, but long before. If you work in an environment where you feel you should be donning a fire helmet most days, you’re already familiar with the dangers of avoiding a proactive approach to problem solving. Those who are constantly in a reactive state have no time to prepare technologies that will increase competitive advantages for the enterprise. Considering the increasing role of technology in today’s super-competitive market, even entire organizations can easily fall victim to the selective nature of TechnoDarwinism.

A Few Guidelines

To help ensure that efforts in the lab are indeed useful, consider the following guidelines.

Standardize the User Environment

Too many enterprises lack strict standards for the user environment. Instead, they let machines exist with varying directory structures, office automation suites, hardware platforms, and even operating systems. Because we’re generally financial liabilities to most organizations, we must find ways to reduce the cost of supporting machines in the environment to justify our continued existence. If each machine is different, there’s no way to benefit from the economies of scale that we’d enjoy in large enterprise environments. While a discussion on the importance of enterprise standards is well outside the scope of this article, organizations that lack a strict policy on hardware and software standards are destined to drive IT support costs significantly higher than truly necessary. Without a normalized environment, we have no way to predict successfully our ability to re-create the results derived in the lab in a production environment.

Research Known Incompatibilities Before Trying to Change Production Environments

The inability for certain Compaq servers to boot Windows NT successfully after installation of Service Pack 4 is well documented on Compaq’s Web site, but we most likely didn’t find that out until after the blue screen appeared. All too often, bonus-protecting managers insist that a deployment be done by some arbitrary date, leaving us with little time to perform the required testing or research. A simple visit to Compaq’s Web site could have saved us hours of downtime (thus killing the manager’s bonus) and kept us from having to answer the dreaded queries from senior management of how this could have happened.

By visiting the Compaq site before the upgrade, we would have learned that there’s a known incompatibility between firmware v.1.36 and below on SMART/2P and SMART/2E array controllers and Microsoft Windows NT Service Pack 4. Armed with such knowledge, we could have applied SSD 2.08 (as per the guidance of the Customer Advisory) while we had the scheduled downtime. Had we taken a single proactive step to gather more information regarding the task at hand, the SP4 installation on the server might have succeeded.

Document All Procedures Performed in the Lab Environment

The most important way to increase the repeatability of your work in the lab is to make sure you document every step of the process, no matter how trivial it may seem. Our notes must be so detailed that a third party can easily re-create our work without our involvement.

It’s also essential that you have a peer (or a QA group, if your organization has one) review your documentation. As authors, we have a tendency to make assumptions that we may not clearly document in the text.

Create Identical Lab and Production Environments

If we hope to gain any useful data from our lab experiments, the lab must closely resemble the production environment for the task at hand. For example, if we want to simulate the interaction of an application across domain trusts, we must first establish a similar environment to what we have in production. While it’d be ideal to match every aspect of the production environment in the lab, this is often cost-prohibitive. Instead, we may be able to simulate the 10 servers making up the domain architecture using decommissioned desktops and servers to simulate the interaction of our product in a multi-domain environment. The same is true for testing driver updates, hot fixes, and other system-level software changes to hardware. This includes making sure that the firmware revisions, drivers, card locations, memory, processor count, etc. in the lab equipment match what’s being used in production.

Each application installed on a machine wants to install its own DLLs in the system directory, and perhaps the latest version of MDAC installed with Office 2000 may just break the critical database application the primary user runs each day. Without significant testing in a lab that mirrors your standardized production environment, you can’t provide any assurances (beyond mere guesswork) to those who count on you that your efforts will be truly successful.

Use Scripting Methods to Improve Repeatability of Results

One of the best ways to make sure you can repeat complex operations is to write a script to perform the upgrade. Once the script runs the way you want it to, it can be easily run in the production environment to duplicate your efforts exactly. This is especially useful when trying to apply complex NTFS permissions, create users or groups, or modify the Registry. Scripts also help assure that the environment has been initialized to a known state for each test we perform, which is essential for garnering valid data from our experiments.

Using the Active Directory Service Interfaces (ADSI) with our favorite programming language, we can perform almost any Windows NT, Windows 2000, Exchange, IIS, or Novell administrative function programmatically. This can be useful not only for developing scripts that will re-create our actions in the lab in a production environment, but we can also use Visual Basic and ADSI to create powerful scripts that can re-create the production user domain SAM in our lab environment.

If you find the concepts in this article interesting, you might enjoy the following links:

Avoiding Extinction

To help increase your chances of success for implementing new changes in your production environment, here are some steps to follow:

  • If you’re operating in a non-standardized environment, seize the opportunity to implement standards when performing a major upgrade to the enterprise (such as Windows 2000).
  • Research potential known incompatibilities for the software or hardware you’re about to install.
  • Re-create the elements of the production environment that will be affected by your changes in a non-critical environment or isolated network.
  • Document your experiences and lab procedures with meticulous detail.
  • Script procedures in the lab environment where possible to guarantee the same procedure will be followed when it’s moved to production. Whether it’s being used to initialize the environment during the testing or to perform the actual task at hand, scripting can help assure consistent results.
  • Test the impact of a new application or system update with all critical applications. Simply logging into the client isn’t an adequate test for most deployments.
  • Have a third party validate your documentation to make sure it can be reproduced without your intervention.

The next time you avert a major system outage because you found the problem and resolution before the change was implemented in a production environment, raise a glass to the parents of scientific thought for their contribution to your success.

comments powered by Disqus
Most   Popular