Neon Enterprise Software Blog

Welcome to Neon Enterprise Software Blog Sign in | Join | Help
in Search

Data Management Today by Craig Mullins

News, views, and issues involved in managing data as a valuable corporate asset.

The Importance of Metadata, Part 3

Well, it is a new year, but I am going to continue my series of recent posts on metadata and its importance. In part one, I provided a high-level definition of metadata and in part two, I diuscussed the difference between technology metadata and business metadata. Today, I want to talk a bit about the data dictionaries and repositories, the common storage and management technologies used to track metadata.

A repository stores information about an organization’s data assets. In other words, repositories are used to store metadata. Repository technology can be quite useful when implemented properly. A correctly implemented repository stores all pertinent metadata for the corporation. It can act as a single, centralized mechanism to assist in the migration of data from the multiple sources to a data warehouse.

In choosing a repository, base your decision on the metadata storage and retrieval needs of your entire organization, not just the databases you wish to support. Typically, a repository can:

  • Store information about your data, processes, and environment.
  • Support multiple ways of looking at the same data. An example of this concept is the three-schema approach, in which data is viewed at the conceptual, logical, and physical levels.
  • Store in-depth documentation as well as produce detail and management reports from that documentation.
  • Support data model creation and administration. Integration with popular ETL, data modeling, and CASE tools is also an important evaluation criterion.
  • Support for versioning and change control. Versioning helps to synchronize application development, eliminating rework, and increasing flexibility.
  • Enforce naming conventions.
  • Parse and extract metadata from multiple sources. For example, if your site is a big COBOL shop the repository vendor should offer tools that automatically examine your COBOL source code to extract metadata. Of course, support for other high-level languages should be offered, notably modern languages such as Java.
  • Generate program definition blocks (copy books for you COBOL folks) from data element definitions.

These are some of the more common functions of a repository. When choosing a repository for database development, the following features generally are desirable:

  • The data stores used by the repository can be stored using database tables in the DBMS you are using. This enables your applications to directly read the data dictionary tables. For example, if you are primarily an Oracle shop, you should favor using a repository that stores its metadata information in Oracle tables. Some repository products can use multiple DBMSs and allow the user to choose the DBMS to be used.
  • The repository should be capable of directly reading the system catalog or views on the system catalog for each DBMS you use. This ensures that the repository will have current information on database objects.
  • If the repository does not directly read the system catalog, an interface should be provided to ease the population of the repository using the system catalog information.
  • The repository provides an interface to any modeling and design tools used for the generation of database objects.

Some of the popular Repository products are mainframe-based and rely on a centralized metadata “database,” or repository. This approach usually is better-suited for documenting OLTP-based systems. Such a repository may be more difficult to use in a data warehouse environment because a mainframe focus can present challenges when managing metadata in a distributed, state-of-the-art data warehouse implementation. Many ETL tools used in data warehousing projects also contain a repository that is geared toward the needs of the data warehouse. Organizations needing to manage metadata for both OLTP and data warehouses should make sure that the data in their ETL repositories can be migrated successfully to their OLTP repository in such a way as it remains useful.

Repositories are also commonly found in data modeling tools these days, such as Embarcadero Technologies' ER/STudio and Computer Associates' ERwin. And there are other repository products that are application-centric. Such repository technology focuses on application development metadata – which is useful, but not comprehensive. For example, the Microsoft Repository is focused on Visual Studio and is focused on Microsoft computing assets.

Benefits of Using a Repository

Repository technology provides many benefits to organizations properly exploiting their capabilities. The metadata in the repository can be used to integrate views of multiple systems helping developers to understand how the data is used by those systems. Usage patterns can be analyzed to determine how data is related in ways that may not be formally understood within the organization. Discovery of such patterns can lead to business process innovation.

In general, the primary benefit of a repository is the consistency it provides in documenting data elements and business rules. The repository helps to unify the “islands of independent data” inherent in many legacy systems. The repository enables organization’s to recognize the value in their legacy systems by documenting program and operational metadata that can be used to integrate the legacy systems with new application development.

Furthermore, a repository can support a rapidly changing environment such as those imposed by Internet development efforts on organizations. The metadata in the repository can be examined to produce impact analysis reports to quickly determine how changes in one area will impact others.

Reusability is a big time saver. If something can be reused instead of being developed again from scratch not only will time be saved, but valuable resources can be deployed on more crucial projects. Repositories facilitate reuse documenting application components and making this metadata available to the organization.

Finally, repositories are an invaluable aid to data warehousing initiatives.

Repository Usage Challenges

One of the biggest challenges in implementing and using repository technology is keeping the repository up-to-date. The repository must be populated using data from multiple sources – all of which can change at any time. When the composition or structure of source data changes its metadata most likely will need to change, too.

The process for populating the repository is complicated and should be made as automated as possible. Metadata sources come from multiple areas and locations within an organization and can include:

  •  Application component metadata from program development tools, application programs and code libraries.
  • Business metadata from business user input, documents and memos.
  • Data modeling metadata from data modeling tools.
  • Database metadata from the DBMS system catalog.
  • ETL metadata from data warehousing tools.
  • Operational metadata from automated operations and job scheduling tools.
  • Other types of metadata such as data usage metadata from query tools.

 To be successful, this information needs to be collected, parsed, and recorded in the corporate metadata repository. The integration process must take into account the frequency of change for each metadata source. Whenever metadata changes at the source, the metadata in the repository will be out of sync until the metada source is scanned, captured, and integrated into the repository again.

Many shops do not own a repository. More accurately, few shops own a centralized metadata repository, although most shops have repositories in multiple software products (e.g. in their data modeling tool or their development workbench). Many organizations that do own a repository do not always implement the proper integration and usage procedures causing the repository to be neglected. As soon as the metadata in the repository becomes outdated, inaccurate, or non-existent, the repository will cease to be of value. Of course, the fault does not necessarily lie with the repository technology – more likely it is the fault of the organization that does not implement the proper procedures for keeping the metadata in the repository up-to-date. Such an effort requires a significant budget, commitment, and the effort of skilled data management professionals including DAs and DBAs.

Data Dictionaries

Data dictionaries were the precursors to repository technology. Data dictionaries were popular in the 1980s. The purpose of a data dictionary is to manage data definitions. In general, they offered little automation – the user had to manually key in the definitions. In some cases the data dictionary was integrated into the DBMS and databases could be defined using the metadata in the data dicitonary – but this was pre-relational; before DBMSs had system catalogs.

As more and more types of metadata were identified and organizations desired the ability to accumulate and manage such metadata, the data dictionary was transformed into the repository. Usage of CASE tools such as Excelerator and Bachman for application and database development enabled more metadata to be captured and maintained during the development process. As developers became more sophisticated over time, data dictionaries evolved to provide more than just data attribute descriptions. The products became capable of tracking which applications accessed which databases. As such, developers who used the data dictionary properly were able to more easily maintain their systems and applications.

Truthfully, much of this transformation was caused by IBM’s AD/Cycle and Repository Manager initiatives. Even though both initiatives ultimately failed in the marketplace, repository technology was forever changed by IBM’s ventures into this field. For more information on IBM’s initiatives in this area consult IBM’s Repository Manager/MVS by Henry C. Lefkovits, the definitive book on the topic (the book is out of print, but available as a used book on amazon).

 

And for a very thorough and up-to-date treatment of repository management, consult Building and Managing the Meta Data Repository: A Full Lifecycle Guide by David Marco.

 

Summary

 

The basic premise of data dictionary and repository software is that metadata has value, and it should be collected, cleansed, managed, and protected. And furthermore, it should be made available to data consumers to add value to the data usage experience.

Published Tuesday, January 02, 2007 1:01 PM by cmullins
Filed under:

Comments

 

Data Management Today by Craig Mullins said:

The first three installments of this series on the importance of metadata covered most of the basics:

January 10, 2007 9:28 AM
 

Data Management Today by Craig Mullins said:

I wrote a series of popular blog posts about metadata and MP3 files for a previous blog (which has since

July 7, 2008 1:58 PM
Anonymous comments are disabled

About cmullins

Craig S. Mullins is a data management strategist for NEON Enterprise Software, Inc.. Craig has extensive experience in the field of database management having worked as an application developer, a DBA, and an instructor with multiple database management systems, including working with with DB2 for z/OS since Version 1. Craig is also an IBM gold consultant and is the author of two books: "DB2 Developer’s Guide" and "Database Administration: Practices and Procedures."
Powered by Community Server, by Telligent Systems