Does Archiving to Centera or CAS Still Matter?
Over the past 2 years, I’ve noticed a rather drastic reduction in the number of archiving conversations I have with customers. Email archiving still pops up, but most of the folks who need to do it are already doing it. File system archiving seems to be even less common these days, though it still pops up occasionally. There is certainly still a market in healthcare and financials, but even that seems less prevalent than it was at one time. Archiving did come up in a recent conversation, which got me thinking about this topic again and I thought it’d make a good blog post.
Without a doubt, the archive market seems to have shrunk. I’m reminded of my time at EMC a year and a half ago when I had to go thru some training about “recapturing the archive market”. From the early-mid 2000’s until the late 2000’s, the “archive first” story was the hottest thing going. EMC built an entire business on the Backup, Recovery, and Archive story (BURA), which encompassed the idea of archiving your static and stale data first, to save money by shrinking the amount of data you need to back up and store on more expensive Tier 1 storage. As a result, they made the term Content Addressable Storage (CAS) go mainstream and be copied by others. The Centera platform was a product EMC purchased rather than developed in-house, but they created a successful business out of it nonetheless. The predecessor of the Centera was a product called FilePool. The founders of FilePool are actively involved in another CAS startup now called Caringo.
How CAS Works
The Content Address is a digital fingerprint of the content. Created mathematically from the content itself, the Content Address represents the object—change the binary representation of the object, (e.g. edit the file in any way) and the Content Address changes. This feature guarantees authenticity—either the original document remains unchanged or the content has been modified and a new Content Address is created. |
|
Step 1 | An object (file, BLOB) is created by a user or application. |
Step 2 | The application sends the object to CAS system for storing. |
Step 3 | CAS system calculates the object’s Content Address or “fingerprint,” a globally unique identifier. |
Step 4 | CAS system then sends the Content Address back to the application. |
Step 5 | The applications store the Content Address—not the object—for future reference. When an application wants to recall the object, it sends the Content Address to the CAS system, and it retrieves the object. There is no filesystem or logical unit for the application to manage. |
CAS systems also had another compelling advantage back in the day, that being there was very little storage management involved. No RAID groups, LUNs, or Storage Groups to ever build or allocate. No traditional file system to ever manage. Per IDC, a full time employee could effectively manage considerably more CAS storage than any other type (320TB vs. 25TB for NAS/SAN).
I have to admit, the CAS story was compelling. Thousands of customers signed up and bought hundreds of PB’s of CAS from multiple vendors. The Fortune 150 company I worked for in the past implemented hundreds of TB’s of Centera CAS as part of an archiving strategy. We archived file system, database, and email data to the system using a variety of ISV packages. Given that this market used to be so hot, I’ve often thought about the possible scenarios for it cooling off, and why many people now choose to use a Unified storage platform for archiving rather than a purpose-built CAS system. Here are a few of the thoughts I’ve had so far (comments welcome and appreciated):
- CAS wasn’t as simple as claimed. Despite the claims of zero storage management, in reality I think several of the admin tasks that were eliminated by CAS were replaced by new management activities that were required for CAS. Designing archive processes with your internal business customers, evaluating various archiving software packages, configuring those software packages to work with your CAS system, and troubleshooting those software packages can be cumbersome and time-consuming.
- Storage management has gotten considerably easier in the last 5 years. Most vendors have moved from RAID groups to pool’s, LUN/Volume creation is handled via GUI instead of CLI, and the GUI’s have been streamlined and made easy for the IT generalist to use. Although I would say a CAS appliance can still be easier to manage at scale, the difference is not near as great as it was in 2005.
- NetApp created a great story with their one size fits all approach when they built in WORM functionality to their Unified storage platform, which was soon copied by EMC in the Celerra product and enhanced to include compliance.
- Many customers didn’t need guaranteed content authenticity that CAS offers, they simply needed basic archiving. Before NetApp and EMC Unified platforms offered this capability, Centera and other CAS platforms were the only choice for a dedicated archive storage box. Once NetApp and then EMC built in archiving into the cost-effective mid-range Unified platform, my opinion is it cut Centera and other CAS systems off at the knees.
- CAS systems were not cheap, even if they could have a better TCO than Tier 1 SAN storage. It was primarily larger enterprises that were typically able to afford CAS, while the lower-end of the market quickly gravitated to a Unified box that had archive functionality built in.
- Backup windows were not always reduced by archiving. Certainly there were some cases where it could help, but also areas where it did not. As an example, many customers wanted to do file system archiving on file systems with millions and millions of files. When you archive, the data is copied to the archive and a stub is left in the original file system. Using traditional backup, these stubs still need to be backed up, and the backup application sees them as a file. This means even if the stub is only 1KB, it still causes the backup application to slow way down as part of the catalog indexing process. There are some workarounds like doing a volume-based backup, which backs up the file system as an image. However, there are caveats here as well. As an example, if you do file-system de-dupe on an EMC platform in conjunction with archiving, you can no longer do granular file-level recoveries from a volume-based backup. Only a full-destructive restore is allowed.
- Many customers didn’t really need to archive for compliance purposes, rather they simply wanted to save money by moving stale data from Tier 1 storage to Tier2/3 storage. This required adding in cost and complexity for a file migration appliance or ISV software package to perform the file movement between tiers, which ate away at the cost savings. Now that many storage arrays have auto-tiering functionality built-in, the system will automatically send less frequently accessed blocks of data to a lower tier of storage, completely transparent to the admin and end-user, with no file stubbing required.
To sum it up, what would I recommend to a customer today? CAS is still a very important storage product and although it’s not a rapidly growing area, it still has a significant install base that will remain for some time. There still are some things that a CAS system can do that the Unified boxes cannot. Guaranteed content authenticity with an object-based storage model is certainly one of those, and probably the most important. If you require as good of a guarantee as you can possibly get that your archive data is safe, CAS is the way to go. As I alluded to before, this still has importance in the healthcare and financial verticals, though I see smaller institutions in those verticals often choose a Unified platform for cost-effectiveness. Outside of those verticals, if your archive storage needs are <100TB, I’m of the opinion that a Unified platform is most likely the way to go, keeping in mind every environment can be unique. There may also be exceptions for applications that offer CAS API integration thru the XAM protocol. If you’re using one of those applications, then it may also make sense to investigate a true CAS platform.
Further reading on CAS:
Good follow-up by SearchStorage making many of the same points I brought up in my article and a few new ones:
http://searchstorage.techtarget.com/magazineContent/Centera-End-of-an-era-or-end-of-an-error
Great article Dan, thanks!