In-Depth

Content-Addressable Storage: A Good Option for your DRBC Plan

You've got SAN and NAS on the mind. What about unstructured data? And even more so, ways to store unstructured data? Enter, the CAS.

Up until just a couple of years ago most people only recognized two types of storage. All that has recently changed. Your choices are plentiful. Got your DRBC brain neurons firing? Let’s see what our options are.

Let’s start with the data types first. There are three basic types of data, but can easily be separate into just two distinct segments: structured or unstructured data. Structured data typically involves databases, where there are relationships, rows of records, indexes, triggers, stored procedures, and so on. It's clear that data residing inside even the most basic databases is rigorously structured. Think about the last Microsoft, SQL Server, Access or FoxPro database you created. You had to worry about different field definitions (type of data, length, mask) along with relationships, indexes and other elements. Generally speaking, it's evident that rigor is involved with a database, and for good reason; hence, the term structured data.

Secondly, there is unstructured data. This data is typically comprised of files: predominantly office automation documents in most organizations, but other sorts of lone files as well. Bitmap, JPG and TIF images of the corporate logo, for example, tend to be strewn across corporate file servers, alongside the occasional bitmap, installation .EXE, HTML page copy, and so forth.

Normally, unstructured data is either kept on the user’s local drive or on some sort of server share. There are two types of shares: UNIX/Linux, called a Network File System (NFS) share, and Windows, which uses a file-sharing protocol called Common Internet File Sharing (CIFS). As Windows administrators, you should be highly familiar with CIFS (and the much maligned "whack-whack" terminology). It’s probably safe to say that many of you have worked with NFS, but for some, the idea of an NFS share may be new and strange. Thanks to advances in Linux technology, NFS shares are almost as easy to set up and share as the familiar Windows shares. Hence, enterprises are engulfed with a plethora of hidden and visible shares (alongside a variety of workgroups that show up in Network Neighborhood as well, but that’s another story).

Structured data, on the other hand, is almost always kept on a server -- typically on a box dedicated to databases and not intended for ordinary file keeping.

What’s the Diff?
When thinking about the difference between structured and unstructured data, what is it that almost immediately comes to mind? Right! Access levels are far different with structured than unstructured data. Users are continuously querying databases, extracting rows of information from them for various reporting or update purposes. But random files -- even if they're in a standard Windows share -- are placed by a single user, most often in a single directory. To be sure, others may access the file over time, but almost always it is a single individual that placed the file there in the first place. You store the file with a name and you retrieve it with a name. It's a singular event, quite unlike multiple users interacting with a database.

Further, server administrators must manage the file and share permissions so that only those authorized to view the file may do so.

When accessing databases, you don't actually worry about a filename. You connect to the database and retrieve the rows and/or columns for which you've been given permission to view and modify. Who does the permission work? Usually, it is the Database Administrator (DBA) and not the server administrator that handles database permissions. (I know, I know...I said there were three data types. Sit tight. We're getting there!)

However, there is another way to think of unstructured data. Suppose that you're dealing with medical images, photographs or even music files. Truly, each piece of data consists of a file, doesn't it? But if you think about it, this kind of file data goes beyond simply placing a file on a share, doesn't it? Suppose, for example, that we're talking about an x-ray of a tibia bone, perhaps one of several x-rays we have for a certain patient. Or one of thousands of x-rays of different folks’ tibias. Now we can think about his lone little file as more than a file, can't we? Intuitively, we have a single record that goes along with other file records.

Further, with this kind of data there is probably some data about the data that is kept with the file -- what we call metadata. In the x-ray example, maybe we keep information about the patient name, the name of the x-ray technician who took the shot, the radiologist that read the x-ray, date and hospital location of x-ray and so forth. Metadata is incredibly important to unstructured files like this. But at the end of the day, they're still simply stand-alone files, right? They're not considered structured, database-like, files. One can pull up a single file to view it, massage it, delete it, put it away -- whatever -- without the need to look at or touch any of the others.

Even Less Refined: Unstructured Object Data
We call this type of unstructured object data. Object data is more important than your single, stand-alone files. An object data element may have a lot of other companion objects similar to itself. There is most likely some metadata that accompanies this file and others of its ilk so that people reviewing it have a clue about its origin and purpose. Another term we can use to describe object data is fixed content.

Fixed content files aren't likely to be hit as frequently as databases are, but they are probably more frequently hit than the average Excel sheet someone from Finance pops on the Shared drive on a server.

Clearly there can be quite a difference in the data types. When talking to folks in the data storage industry, it’s important to remember the notion of structured versus unstructured data, as well as the concept behind unstructured fixed content data, as opposed to basic unstructured files. This delineation determines the makeup of your DRBC architecture.

Data Access Differences
Additionally, storage architects know that databases are accessed quite differently than unstructured data. When querying a database we need to obtain a certain piece of information on a given volume, quickly grabbing the blocks of data we require to satisfy a request for information. This is called block-level access. Generally speaking, Storage Area Network equipment, especially Fibre Channel but also iSCSI-equipped gear, has been tuned and is optimal for block-level reading and writing -- very fast access. Fibre Channel SAN gear tuned for block-level access has been a tremendous boon for large, enterprise-class databases. On top of the superb performance you'll get out of a SAN for your block-level data, you also gain incredibly high fault-tolerance -- read that as high availability -- along with performance.

On the other hand, user’s PST files, office automation documents, informal graphics, MP3s -- in other words, ordinary unstructured files -- are best kept on a Network Attached Storage device. NAS is designed to be low maintenance, highly available, with some modicum of performance, though clearly not the performer that a full-on FC SAN is. In most cases (EMC’s notion of a NAS is somewhat different than most), NAS consists of ATA-class drives with a very low overhead OS that allows administrators to build shares and perform routine maintenance such as backups. There are tons of great NAS players on the landscape.

But NAS isn't an ideal place for fixed content. The idea behind NAS is that users and may infrequently connect to it to retrieve a file, work on it, and then put it back. While in some cases there may be a frequent daily file access operation for certain files, it is not performed with the same kind of frequency that a database experiences.

Fixed content objects may not experience much more usage than the day-to-day unstructured data on your NAS boxes, but their importance is far greater. If you delete a bitmap of the corporate logo off of a shared drive, doubtless someone has a copy of it sitting on a Mac G5 or PC somewhere. Delete an x-ray of John Smith’s tibia, and that may be the only live edition of the file that you have. Mr. Smith will sue your company, and heads will roll.

Enter CAS
A new type of storage mechanism is required for fixed content data. We need to have a way of popping a file down onto the storage device and keep it there for a "pre-prescribed" period of time. In other words, we want to practice high-quality retention enforcement. In the case of Mr. Smith’s tibia x-ray, there is a Federal mandate forcing us to do so -- the far-reaching Health Insurance Portability and Accountability Act. And those that are keeping track of financial records must comply with the Sarbanes Oxley act. Nearly every type of endeavor has some sort of need for long-term, safe storage of fixed content data.

Content-addressable storage turns out to be the perfect solution, because it's designed to accommodate fixed content data. The idea is straightforward: Each object going into the system is given a unique address. In the case of the EMC Centera, the address is 27 bits long. Others, of course, have their own way of addressing the solution. The long and short of it -- the thing to remember -- is that each data object going into the CAS is assigned an address. Some people use the car valet metaphor to describe it. You give your car to the car valet and he or she gives you a number. When you're ready to collect the car, the valet checks the number and thus knows how to get your car back for you. Generally speaking, the retrieval isn't predicated on the make, model, and type. It’s based upon that unique number. This is the way it is for fixed content object addresses. The CAS knows where to get it based upon its address, not its content. Thus, you could conceivably have a health record sitting right next to a personnel document. Document pools are used to manage data with unique retention requirements.

Security Is A Huge Advantage
When you procure a CAS, you're obtaining a device that expects to put fixed content-class data into the array assigning a unique individual address to each. Which means that you'll need some sort of user interface to retrieve it. The unique addressing and UI requirement means that security can be tightly controlled because you: a) can't get at the data using normal TCP/IP commands and b) You must possess the UI to access the data (not just any Tom, Dick or Harry can use ordinary Windows tools to hack in). In EMC’s case, the Centera was designed to "play nicely in the sandbox" with a variety of different document management platforms such as Documentum. A document management system blends together the notion of standardized documents that are managed and stored with some sort of centralized administration policies in mind. The combination of CAS and a DMS is the one-two punch that fixed content data needs.

CAS brings heightened document retention capabilities such as the ability to "e-shred" documents that have lived out their retention time and now must permanently go away. You have the ability to assign a variety of retention classes to various categories of files. One of the more interesting capabilities is to put a "litigation hold" on a group of files that have been subpoenaed for trial.

E-mail continues to be of major importance for organizations. CAS manufacturers have come up with ways to streamline the archiving of important e-mails for easy, fast retrieval.

The CAS/DMS combination, when thoughtfully architected and deployed, brings superb DRBC and security value to organizations.

Document Manager/Auditor: All Ideas
In many organizations, there is a person who's put in charge of managing important document retention, destruction and sharing policies. This person is a the auditor and manager of corporate document policies and is generally the idea-person behind the way that the CAS is architected and set up. Perhaps Sar-Box requires a 9-year retention period for certain financial transactions and an unlimited retention period for others. When a document object first hits the CAS, it's put into a pool to which a given retention policy applies -- all based upon the decision-making of the Document Manager and associates.

The Document Manager is also the person who makes decisions about the storage and retrieval of said items: Who, when, why and how. Suppose, for example, someone tries to delete a file that must comply with privacy laws and is under a retention policy. An audit detects the attempt, providing the document manager with very granular control over the content. The delete attempt is thwarted.

Can't We All Just Get Along?

At one time I worked for a large Colorado municipality. We had a person that acted as the Document Manager for the city. He wasn't technical at all and needed the input of the IT department technicians to guide him in architecting and establishing enterprise-class document systems.

His key interests lie in the area of making sure that documents were uniform in their structure -- i.e. you could immediately tell that a document originated from the city as well as which department it came from -- and that they all had a safe place to live during their retention period. He established retention periods based upon nationally recognized guidelines, and went around glad-handing a lot of different department heads in an effort to make sure everyone was on the same page in terms of document policies.

The problem came when IT found out that there were not one, but several different document management mechanisms that various departments had purchased already installed and in use. One group had a Hummingbird deployment, another used FileNet, still another used a weird specialized document management system from an largely unknown company. In all, we had seven different document management implementations, each of which used a different vendor’s approach to document retention and policy.

While there are CAS systems that can interface with a variety of different inputs, my experience tells me that such integrated systems do not work well. You work hard to get mainframe data correctly copying itself to a remote repository, only to find that a Linux-based copy operation isn't operating as expected. Integrated systems require way too much time and manpower. If possible, it is better to scrap the old and bring in the new.

Which is exactly what IT recommended: Procure a CAS system and a single uniform document management tool that could be utilized by all. Such a system is not only possible, but there are hundreds of examples in place all over the world. An enterprise-class CAS array tied to a formal DMS with the capability of hosting a variety of disparate groups affords you the power to meet an enormous variety of needs.

This was not to be, however. As it turns out, people can often be the most difficult problem when designing and implementing new systems, especially DRBC systems. Such was the case in this situation: Different departmental stakeholders could not agree, would not give up their siloed systems in favor of an enterprise-class approach. They would have to be dragged kicking and screaming from their "solution," never mind that in a couple of cases these very same people complained about how terribly it functioned. The document manager was left to manage, as best he could, an array isolated implementations with a variety of file-system and archival approaches.

The DRBC Line-Up
So, from a DRBC perspective, how does it all come together? Money not being the object, you'd ideally want to procure SAN for the enterprise-class databases, NAS for the day-to-day file shares, and CAS for fixed content data that cannot go away (whether mysteriously -- i.e. we're not sure, but we think Margie accidentally hit the delete key -- or on purpose).

You’d also want to architect a long-distance solution that could handle immediate synchronous operations, especially on the CAS. In other words, within a few minutes of a fixed content object hitting your CAS in San Francisco, you'd be able to assure yourself that it was copied to your Chicago offsite operations. You may not care very much about spreadsheets sitting on the shared drive, but you and the document manager had better care about private e-mails, financial documents, patient history and other regulated information. Even if corporate HQ augers in, that balance sheet had better be available when called for.

The CAS would require some sort of a strong document management software implementation, which means yet more architecture time, a well-crafted project plan, and additional months to deploy.

The SAN architecture, also subject to synchronous writes, would require intense DBA and system administrator interaction. I highly recommend that a clustered Exchange implementation be integrated as a part of the SAN system.

PSTs and other unstructured data could live on an ATA-class NAS device, subject to periodic offsite shadow copy operations. We care about PSTs, but not as deeply as we care about customer purchases, patient history, stock trading, licenses or other business documents.

In the end, even if everything on your side of town was simultaneously on fire, underwater and sucked into a gaping sinkhole, you could proudly phone your CEO and say "Boss, got ‘er covered. It’s all good. No worries here."

Provided, of course, he’s still available to take the call.

comments powered by Disqus
Most   Popular