Islands of Storage: Keep It Safe with a Disaster Recovery and Business Continuity Plan -- Microsoft Certified Professional Magazine Online

Islands of Storage: Keep It Safe with a Disaster Recovery and Business Continuity Plan

If your thoughts are occupied by storage and how to keep it all secure and available, maybe you should begin thinking like a DRBC architect.

By Bill Heldman
03/07/2006

Enterprise systems administration is one of those jobs: when things go right, you're glad that you don't have a lot of fires to put out that day. But then there are the days when the fires blaze hot and heavy all around you, and the day flies by without you getting any work done at all...except for triage.

Those days are bad, but hopefully, the worst day of all will never arrive. That day when the entire building disappears. Poof! Now for my next magical trick, I shall make a large-scale computing enterprise vanish before your eyes! So, what if the shoe really does drop? What if that fateful day actually occurs, and you're left standing on the edge of a precipice, looking at what used to be the building you walked into every morning? But wait, it gets worse. What if while you're looking into that smoking crater, the CEO is standing right behind you, demanding: 'You mean to tell me that you don't have a back-up plan? That your so-called back-up plan is sitting in the bottom of that...that hole? That we don't have any systems that we can fall back on?

Like a scene out of It's a Wonderful Life, you can almost hear the CEO saying, “You realize what this means? It means disaster, and scandal and bankruptcy, you ignorant fool!” While he's choking your neck, of course.

This, my friends, is the stuff of disaster recovery and business continuity. You think about the worst possible situation, and then devise a way to recover from it so that instead of being choked to death you can calmly say this: “Well chief, it's bad that you lost your shiny building. But hey, the good news is I've got us covered. I developed Plan B, implemented it, tested it, and now we put it into place. Follow me, and prepare to be amazed.”

Let's start by taking a look a corporate environment called 2Cool.USA, a fictitious 3D Game software manufacturer, which is representative of the kind of enterprises one would expect to find in today’s networked environments. Figure 1 shows 2Cool.USA's three campuses, consisting of a couple thousand users apiece, each with different connections to the same ISP or (we hope for security purposes) entirely unique ISPs. Additionally, the mid-tier storage arrays at each campus come from different manufacturers. Denver has a NetApp device, San Francisco an EMC Clariion, and Atlanta a Sun StorEdge—all high-end boxes.

Figure 1. A typical three-campus network environment. (Click image to view larger version.)

The three campuses are connected together by some sort of WAN circuit. In today’s Internet economy, these connections could easily be VPN links, but the point is that the campuses are connected together separately from their Internet linkage.

It’s the Geography
You don’t have to be Bill Gates to figure out that there are probably at least three different admin groups involved in the daily operations of this organization. Why? Because of the geographic separation of the campuses. Let’s say that in the case of 2Cool, San Francisco campus personnel are primarily involved with coding the actual games, while Denver consists of the HR and Financial folks and Atlanta houses Marketing, Sales and Provisioning. Or maybe there’s some mixture of operational groups. It really doesn’t matter—the point is that there are essentially three different IT shops with three different ways of thinking about addressing their users’ needs.

On top of that there are most likely different camps within each campus IT shop: the Web team might consist of a group of people who report to a different manager than the server manager and so forth. In reality, we can probably easily identify at least six or seven entirely distinct IT personnel groups at each campus:

Security
Web
Database
Servers
Storage
Internetworking (routers, ISP connectivity, telecommunications, etc.)
Enterprise application administrators (ERP, email, systems management, etc.)

Each of these groups has as part of its core mission the protection of the component for which it is charged. Security wants to make sure that all the users and computing equipment are protected from scurrilous attacks. The Internetworking folks want to be certain users can connect to the outside world, and so on.

It is also quite possible that these teams frequently communicate with one another – or not, depending on the robustness of the teams as a whole, and their overarching leadership. For your part as a DRBC architect, this communications element may not be all that clear (at least for now). But it is vastly important. You must discover and conquer it, or you will not be successful with your DRBC plan.

In San Francisco things are even a little more complicated because the programming team has code repository software such as Visual Source Safe in which the latest and greatest game code is kept in one or more database repositories. Not only is there a need for exquisite secrecy as well as rollback capability, there is also the requirement that the code be duplicated off-site for safekeeping. If a game programming company loses its code, it loses its bread and butter. Might as well shut off the lights because the company will probably not be able to recoup from such a devastating loss.

One Tough IT Job – DRBC Architect
Now here’s the crux of the issue: Suppose that you are the project lead charged with putting together a DRBC plan that ropes in all of the corporation’s computing efforts – data, apps, and users. You have three campuses you must worry about, along with three entirely separate ways that the IT shops in them view the world. The ERP guys in Denver really don’t think all that much about keeping newly written code safe. They’re more worried about the financials (nearly as important as the code), HR records, and even procurement and provisioning information. After all, the code has to be burnt to CD or DVD, put in a box and shipped to stores, right? And the folks in the San Francisco IT shop, while certainly worried about getting a routine paycheck, probably aren’t all that bothered about the (highly complex) nuances of maintaining ERP records. Nor is Atlanta concerned about either of the other two campus’s operations (unless, of course, one of the other campuses isn’t getting a job done that directly affects Atlanta.)

It boils down to sort of an “every man for himself” attitude—an “us versus them” thing. I’m sure you’ve heard it a thousand times before. You’re talking to someone from Campus B and you hear them say “Here at Campus B, we have this protocol, and this solution, and this software to solve our problems. We’re not like Campus A.” (Or words to that effect.)

What Next?
What’s a person supposed to do with such a disparate array of people, devices, and needs? How is it possible to put together a DRBC plan at all? Well, I have good news and bad news for you.

First of all, the bad news: As a DRBC architect, not only are you obligated to find a way to safeguard enterprise data that must be protected in order for the corporation to survive (we’ll call this data the corporate value elements), you must accomplish it in such a way that:

The operation’s value elements are held securely off-site somewhere (the DR component); and
The applications used to access the value elements are able to be quickly brought up and operational so as to effect a reasonably rapid and smooth transition from “The San Francisco campus just imploded thanks to a massive earthquake – what do we do now?” to “We’re flying the San Francisco coders to Atlanta where they’ll be able to continue working”. (The BC component)

As we move forward in this series on DRBC, we’ll spend more time on (b). But it is important that DRBC architects first completely understand the concepts behind pushing value elements from one place to another in order to protect them.

It's Not Flooding — Yet

A travel company in Denver is responsible for literally tens of thousands of airline booking transactions per second crossing its vast SAN array.

The company has two campuses, with virtually separate and redundant everything. Airline bookings committed at one site are immediately transmitted to the other for safekeeping.

Sounds like a truly well-architected solution right? There are just two problems. The two sites are not more than 100 yards from one another. Geographic separation is nearly nonexistent — though someone thought to build a sizable berm between the buildings, in case there was an explosion or something.

One other problem is that the entire complex lies in a flood plane. Denver’s a dry place, and we don’t get many floods. But if one were to happen near that complex, everything would be under water (though both datacenters are on the second floor).

DRBC architects must think about such subtle considerations in order to affect a truly recoverable environment for their organization.

Let’s Talk Data
Think about your corporate data. Is it really good enough that the data lives on a RAID5 array on your server, maybe even on a clustered array in the datacenter? Or even that the data lives on a single SAN box? While these progressively fault-tolerant features make for safer data, they do nothing for you in the event the box isn’t available any longer—regardless of how highly fault-tolerant that box may have been. Think about some of the possibilities:

Fire destroys the datacenter
Train derailment produces an ammonia spill and contamination leading to a several-week-long HazMat evacuation
Flood puts your datacenter six feet under water
Earthquake augers in the entire city
Natural gas explosion creates an ashtray in the center of town
Terrorist activity
Etc.

A single isolated box sitting in a datacenter, no matter how fault tolerant, will be enough during a catastrophe such as this.

Ready for the good news? With a modicum of project management skills, some persistence and a little ingenuity, you can architect a solution that meets your DRBC requirements.

Take another look at Figure 1. See anything that lends itself to your current situation? Yes! You have terabytes of space available on not one, but three separate disk arrays.

If there was a way that you could transport the data on, say, the San Francisco disk array to the Atlanta and Denver boxes, you would, at a minimum, have your San Francisco value elements covered, would you not? OK, never mind that we don’t actually have application servers in those other places. And even if we did we don’t have skilled personnel at those sites able to run the applications in order to communicate with the value elements. But we would have the value elements. The DR piece would be in place.

What we’re talking about here is a concept called islands of storage. You’ve got a disk array at each location. All of them were probably over-engineered for long-haul growth. Today they are most likely underutilized. If we could set up some sort of copy operation that took our value elements from one place and put them in another—ideally one or two others—we could breathe a little easier.

We could triangulate off of one another. In the example, Denver and Atlanta could also push their value elements to the other two campuses (see Figure 2).

Figure 2. 2Cool Islands of storage. (Click image to view larger version.)

Why “value elements”?
Are you wondering why I insist on calling the data you copy over “value elements” Here’s the thing: We don’t want to copy everything in Atlanta to Denver and San Francisco, do we? Do we really care if John Smith’s Word document—the one outlining his kids’ chores, the document he was going to post on the refrigerator at home—makes it to Denver?

What we’re singly, only after is the data that is absolutely required to make the company the company. DRBC architects need to take a serious look at what makes the company go. This is the only data we need.

It's easy to say, much more difficult to formulate. Case in point: Are your Exchange databases mission-critical? If a survey were taken, likely 90 percent of organizations would say they could not survive without e-mail and calendars.

Let’s make it harder still: What about personal storage files? Your CEO keeps everything in a local PST (that you may or may not periodically back up from your enterprise tape system). Does her latest and greatest PST qualify as part of the value elements? What if there’s a not-formally-documented, yet e-mail communicated agreement with another company—one on which she’d intended to work with Legal to hammer out final contractual details, just prior to the disaster? The PST isn’t there, so she’s not able to retrace her steps to remember exactly what it was that was communicated. You may not think it’s important, she might think it’s incredibly so.

You can see how discussions such as this, especially when carried on at an enterprise level, take a long time to hammer out to arrive at consensus. Nevertheless, this is job one for DRBC architects: What is it that comprises the corporation’s value elements?

DRBC Architect Considerations
DRBC architects need to keep in mind some basic tenets in order to make this islands of storage phenomenon happen effectively and efficiently. Because there are people and computer issues, let’s group them into “soft” and “technical” tenets:

Soft Tenets
You have to work with other stakeholders in the organization. This means that they have to be committed to what you’re trying to do, and they have to have people showing up at the meetings. You can’t set up an islands of storage architecture by yourself.

Which means that the commitment happens at the top levels. If the CEO/CIO doesn’t say “Make go!” you’re not likely to garner much success, regardless of how noble the goal may be.

There must be some initial discovery of exactly what the corporate value elements are, and subsequent written policies built around the establishment of those stated value elements. Expect religious wars:

E.g. Is Marketing’s 1999 ad campaign required for ongoing business?
E.g. The Atlanta guy gets into a shouting match because he believes his Sun box is so much more and better than the EMC box. Whether this is true or not – it doesn’t matter. What you’re after is an islands of storage solution, not (necessarily) a wholesale change-out to like boxes.

Extra budget is going to be required to cement your DRBC plan in place. Company leaders have to be committed to buying things and paying for help to make the plan come into being.

You can’t just go in and rob terabytes of space on someone else’s disk array. They’ve likely set up some sort of growth plan. So, even though you think they’re “hardly using” the space, they’ve probably got it all apportioned. You have to work out quid pro quo agreements that procure more space for them at a later point, in return for you using extra space now.

You Scratch My Back, I Scratch Yours

I’ve seen some pretty cool quid pro quo agreements that have been developed between organizations.

One large city in California wanted to set up an off-site location for their value elements. But they couldn’t afford the monthly costs associated with transporting their data to a so-called “warm” pay-for-play site. So city IT managers worked with a Midwest city and came up with some interesting quid pro quo elements:

The California city purchased the rack and storage gear needed, along with the telecommunications connectivity between sites
The Midwest city agreed to lend some datacenter space for the California city
There was no requirement for the Midwest city to do anything to monitor or administer the California city’s equipment. Most monitoring and administration could be handled remotely by California city IT personnel. Occasionally someone would fly out to check operations.
In return, the Midwest city set up exactly the same kind of operation in the California city’s datacenter

By scratching each others’ back in an islands of storage architecture, each city’s value elements were protected, and budgets were managed.

Technical Tenets
Disparate disk arrays require some sort of standardization to allow for interconnectedness. You will most likely plan on iSCSI as the backbone protocol for your endeavors.

One of your major efforts will be in finding out how the boxes can talk to each other, and how the value elements can be sent from place to place. This may require professional services assistance from more than one vendor.

Most disk arrays don’t come bundled with all the necessary software to make them do everything they’re capable of doing. For example, perhaps the San Francisco office purchased the Clariion with absolutely no plan to ever copy data anywhere else. Now you’ve got to procure that copying software, install it, get working and then test it out. Questions come up: Who pays for the software? Who installs it? Who manages it?

Once the linkages are set up, you have to decide what kind of copy and write operations you want to have:

For example, you can set up some disk arrays such that as soon as a write is made to one side, it is instantaneously made to the other. Long-haul geographies have what are called asynchronous writes – there is a minor time delay in the write. Disk arrays within a certain distance of one another can undergo synchronous writes that happen simultaneously.
A copy typically involves a point-in-time snapshot of the value elements. The snapshot is subsequently copied over the wire to the other array. You can have multiple snapshot copy events taking place during the day. Some array software supports taking a snapshot delta—just that data that has changed—and subsequently copying to the sister array.
DRBC architects often configure different copy and write operations for various data elements. For example, HR data that is fairly static does not need to be copied as often as Sales data (or so we hope).

Robin Williams said it best: Redundant, redundant, redundant! The idea is to make everything geographically separate and redundant. Which means more than one path to a campus (see Figures 1 and 2). Which means that the disk arrays must somehow automatically accommodate for a path not being available. DRBC architects think about linkages and paths, making sure that redundancy is built into all operations.

Test before deployment. Using what is called a “table-top” disaster scenario, you need to run the entire thing through its paces. You set up a day where everyone involved (at all locations) is told that such-and-such a disaster has occurred. You then walk through the steps you think you’d need to go through to make sure the organization’s value elements are duly protected and available. Table-tops are fun because you get to be creative and invent a disaster you don’t tell anyone else about. You simply spring it on everyone on table-top day, and see how folks behave. You will learn more at table-top time than you did throughout the entire project leading up to it.

Run annual disaster recovery scenarios. It is important to practice putting the system through its paces, at least once a year. Some shops perform a table-top disaster recovery scenario every six months.

But wait! There’s more!
As we stated earlier, we have the data, but we don’t have anything else. If you’ve successfully copied gigabytes of an Oracle database over to another disk array, it’s not going to do you any good if you don’t have Oracle running on a server at the other site, alongside the applications and people that use the database.

In DR parlance, we have a warm site, but don’t have a hot site. Next time we’ll talk about the ideas involved in creating a hot site. Until then, you’ve got lots of DR architecting work to do!

Register! Top 5 Hybrid AD Management Mistakes and How to Avoid Them