Neon Enterprise Software Blog

Welcome to Neon Enterprise Software Blog Sign in | Join | Help
in Search

Data Management Today by Craig Mullins

News, views, and issues involved in managing data as a valuable corporate asset.

Database Archiving: How to Define What to Archive?

OK, so we've come to the conclusion that we need to archive data from our operational databases. Maybe it was driven by regulatory issues that dictated long-term data retention. Maybe it was influenced by the desire to improve the performance of your operational databases and applications. At any rate, you know the business requirements that are driving you to archive. And that is nothing to belittle because getting a handle on all of the regulations and how they impact your business can be tough.

 

The next step is to determine exactly what must be retained. In other words, we are going to archive at the business object level. Another way of saying that is that we need to archive selected records. This is important because we do not want to archive database data at the file level. We need only those specific pieces of data that are no longer needed for operational and reference purposes by the business. This means that the archive needs to be able to selectively choose particular pieces of related data for archival… not the whole database, and not an entire node (table or segment). Instead, all of the data that represents a business object is archived at the same time.

 

What does this mean? Consider, for example, if we choose to archive order data, we would also want to archive the specifics about each item on that order. Think about the order entry and management systems you've encountered in the past. You cannot model such a database such that all the data is in one table. No, you'll have an ORDER table and a one-to-many relationship to the ITEM table: because an order can consist of more than one item. Simply thinking about it, if you go to amazon.com and order three books, there is one order with three items on it, right?

 

Furthermore, your system will likely track products and customers, so we'll have a CUSTOMER table with a one-to-many relationship going from CUSTOMER to ORDER (that is, a customer can place more than one order, but each order is for a single customer only). And we'll have a PRODUCT table with a one-to-many relationship going from PRODUCT to ITEM (that is, a product can be on multiple orders, but each individual order will have only one entry for the product - if the customer order multiples, it will be specified in the quantity field in the item). Just to make it interesting, let's also include a table called LOCATION that stores inventory information for products, such as bin number and quantity on hand. We'd have a a one-to-many relationship going from PRODUCT to LOCATION (that is, a product may be stored in more than one bin, but each bin contains only a single product type). Click here for a graphic image (because, I know, a picture is worth a thousand words).

 

So our data spans multiple constructs within the database (tables for DB2 or Oracle; segments and/or databases for IMS).

 

What do we archive? Let's say we need to retain order details for 7 years after the order was shipped. Do we really need all of that order data clogging up our operational databases for the full 7 years? Probably not. For the purposes of this business, the data is operationally relevant for perhaps a month or two, and it might be referenced for reporting or by customers for a year to 18 months after that. So we decide to keep it in the production databases for two years after the product ships, and archive it after that.

 

But what needs to be archived? If we just archive the ORDER and ITEM tables we'll be missing pertinent information. We also need to bring along customer and product information for reference. That is, we have to make sure that the archive contains more than just the CUSTNO in ORDER and the PRODNO in ITEM. We also need the customer name and contact details. Why? Think about it for a moment... what good would CUSTNO 18123546 be if that customer is no longer in the operational database? And would PRODNO 99 be of any use without the remaining product details?

 

So we need to set up our archive policy to archive the orders from the ORDER table along with the related items in the ITEM table. By archive, I mean we are deleting the data from the operational database when we put it into the archive. For more details, read this. And we need to bring along related products from PRODUCT and customers from CUSTOMER. This data is moved to the archive with the ORDER and ITEM data, but it is not deleted from the operational database. After all, those products may be in other order items that we are not archiving. And those customers may have other orders that we are not archiving. No, we would need to create other archive plans for our customer and product data.

 

Let's think about one of these... let's say we need to retain information about products we have sold for as long as we sell it plus 12 years. That might seem to be onerous to you, but it really isn't. There are all types of products out there with very long retention needs. Consider the data retention requirements of medical devices. Wouldn't you need to maintain information on a pacemaker for the lifespan of any patient who had one embedded? That could be a long time, indeed - longer than 12 years at any rate.

 

So we might decide to create an archive policy for our PRODUCT business object whereby product information is archived when there is no inventory on hand, and no outstanding orders. Or maybe there is particular business process that is followed to end the lifecycle of a product. Assume we have those details. What would need to be archived?

 

Well, obviously the PRODUCT table would need to be archived. What about the tables that are related to it? Well, recall that ORDER and ITEM are both related to product. But we already have an archive policy set up for that data. And we really ought not to archive PRODUCT details if we have outstanding orders, right? What about the LOCATION table? This is the table that contains inventory details of where the product is stocked. Well, you probably don't care about that data any more. (You might, but for the purposes of this discussion, let's say we do not.)

 

So, we need to build an archive plan to archive product information from the PRODUCT table (delete and move), but delete any related LOCATION data. Of course, if there is still stock on hand, that might be signal a problem with our archive policy. Would we really want to archive product data when we still have stock on hand?

 

Well, I hope this blog entry has convinced you that archiving is not a trivial matter. Defining what is to be archived and how it is to be archived requires a mix of IT skills, business acumen, and knowledge of legal and government regulations. But with the volume of data growing as rapidly as it is, and new data retention regulations springing up all over the place, it is not something that can be ignored.

 

I'll be discussing issues and details regarding database archiving in upcoming blog posts, so stay tuned.

Published Friday, February 02, 2007 1:58 PM by cmullins
Filed under:

Attachment(s): order.jpg

Comments

 

Data Management Today by Craig Mullins said:

OK, for the past month or two I've been blogging about database archiving for long-term data retention

February 27, 2007 3:48 PM
 

Data Management Today by Craig Mullins said:

I've written frequently on the need to archive your database data for long-term retention . Topics

January 14, 2008 10:16 AM
 

Data Management Today by Craig Mullins said:

Operational databases are growing in size for many reasons. There is the overarching trend of more and

August 8, 2008 1:58 PM
Anonymous comments are disabled

About cmullins

Craig S. Mullins is a data management strategist for NEON Enterprise Software, Inc.. Craig has extensive experience in the field of database management having worked as an application developer, a DBA, and an instructor with multiple database management systems, including working with with DB2 for z/OS since Version 1. Craig is also an IBM gold consultant and is the author of two books: "DB2 Developer’s Guide" and "Database Administration: Practices and Procedures."
Powered by Community Server, by Telligent Systems