by Bob Lambert
Picture this: You’re the tech lead on a large ETL project, integrating data from 17 suppliers and five internal systems for a logistics management application. The system will enable planners to optimize the supply chain with integrated order, shipment and production data. The team is finalizing design, having already defined business requirements, identified business data elements, completed database design and assembled Excel source-to-target mappings for all 107 interfaces. Your project manager stops by to tell you that a new supply director wants to change five of your key requirements, and that the database administrators redesigned one section of the model to meet new standards. She needs an impact assessment before the steering committee meeting at 10:00 a.m. tomorrow morning.
Once development starts, data integration projects are served well by metadata tools. Major ETL vendors like Informatica, IBM WebSphere DataStage and others include metadata solutions that document mappings and transformations, enabling impact analysis in the event of interface changes, database design changes and data quality problems. However, metadata associated with development tools only kicks in when development starts. A significant part of data integration effort happens before the virtual pen meets paper to build ETL maps. Development can only begin after the team defines requirements, designs the database and maps source data elements to their targets.
On a small integration project with just a few interfaces, this isn’t a problem. However, on a more complex project featuring scores of interfaces, it can be a major manual task to ensure the design meets all requirements, that the interfaces as designed will load the database properly, and to correctly identify the redesign needed in response to changes in the interface, data model or requirements.
This article describes a working SQL Server prototype (found here) that shows how a data integration team can build its own custom metadata database based on commonly used data integration design artifacts and enable the impact analysis and change management benefits that metadata provides well before ETL coding starts. Furthermore, metadata analysis can help improve the rigor and quality of requirements, mapping and database design artifacts, preventing defects early and, therefore, helping to improve quality and reduce cost.
The Data Integration Project Pattern
Data integration projects typically implement systems that take in interface data, store it in a database and deliver that data in the form of outbound interfaces (see figure). Interfaces may be in various formats but often can be thought of as files containing fields. Likewise, the database consists of tables and columns. Requirements, in our example, are affirmative statements of business objectives, which subdivide into “data requirements,” the logical data elements needed to fulfill each requirement.
The industry standard interface definition artifact is the “source to target document,” typically a spreadsheet relating (for incoming interfaces) each incoming field with its destination database column and describing any transformations that must occur in the process of delivering the incoming field.
On the typical integration project, activities like making sure different interfaces are mapped consistently, or making sure the design loads all required database columns, or determining the impact of a design change are a matter of manually comparing source-to-target mapping spreadsheets – not a problem on a small project with only a few interfaces.
However, Frederick Brooks’ Mythical Man-Month concept comes into play quickly as project size increases. Brooks’ concept is that project complexity increases exponentially as the size of the team increases. According to Brooks, project complexity is related more to the number of communication paths than the number of people. There are n(n-1)/2 communication paths among n people. So, a four-person project has six communication paths, an eight-person project has 28, and a 25-person project has 300. By this logic, the 25-person project is more than 10 times more complex than the eight-person project.
It is easy to see how this concept operates on a large integration project. Say there are 75 source-to-target mapping documents, each mapping about 100 interface fields, halfway into the design phase. If a major requirements shift results in a change to database design, then the team embarks on a major review and revision of all 75 mapping spreadsheets, spending tense hours in review meetings to ensure the changes are applied consistently and the project remains on schedule.
The Conceptual Integration Metadata Model
It is also easy to see how a database linking the field-to-column mappings, database tables and columns, and data requirements could quickly identify the mappings impacted by the database change and reduce the level of effort in identifying defects in the revised mappings. Fortunately, data integration teams tend to be rich in database skills, so development of a metadata database should be within the practical scope of the team’s abilities.
Again, the goal is to integrate requirements, interfaces and database tables, along with their detailed elements: data requirements, interface fields and database columns, as in the conceptual metadata model in the figure:
The model shows that each requirement implies many data requirements, each table consists of columns and each interface consists of fields.
At a detailed level, the relationships show that:
• Each data requirement may be fulfilled by many database columns,
• Each column can potentially fulfill many data requirements,
• Each column may be the destination for load of many fields (for example, if many interface fields combine in an equation to calculate a single column), and
• Each field may load many separate columns.
Data Integration Analysis and Design Artifacts
The core analysis documents of the data integration project are the source-to-target document, the data model and the requirements document. These artifacts will provide the data to be loaded into the database based on our logical metamodel.
Source to Target Document
Before ETL developers can build interface processes, analysts must document which interface field loads which column or vice versa. Source-to-target maps are typically Excel spreadsheets. Each row on the sheet represents the relationship between one data input element and one destination element, and typically includes the following columns:
- Interface Name: The name of the external source or target data structure;
- Direction: Whether the interface is incoming or outgoing from the point of view of the central database;
- Target Table or File Name;
- Target Data Elements: descriptions of each target column or field, including Column or Field Name, Data Type, Length, Description, Source Table or File Name;
- Source Data Elements: descriptions of each target column or field, including Column or Field Name, Data Type, Length, Description;
- Transformations: the calculations, validations and/or conversions required to source data elements to target data elements; and
For our purposes, the data model is the list of tables and columns in the database, and the data type and length of each column. The data model must be complete before source-to-target mapping begins, optimally already deployed to the database management system. If so, and if the DBMS is SQL Server, then the system views in INFORMATION_SCHEMA provide complete application-level metadata, including table names, the columns in each table, and data type and length for each column.
Requirements and Data Requirements
Requirements documentation tends to be less structured than source-to-target mappings and databases. But still, data integration teams commonly maintain requirements in spreadsheets, and on most projects it is practical for data requirements to be reasonably rigorous and therefore comparable to database columns and interface fields. The next figure lists examples of requirements and associated data requirements from our fictional example project.
Bringing it All Together: the Data Integration Metadata Database
[Click on PDF link at the end of the article to view the Data Integration Meta-Database Model.]
This figure is a SQL Server model showing a database designed to house the contents of the project’s source-to-target maps, the tables and columns in the database and data requirements. The database consists of the following tables:
- Requirement: The requirement table holds one row for each data-related business requirement, and includes the text description of the requirement along with reference and priority attributes.
- DataReqmt: A Data requirement is a logical data element needed to fulfill a requirement. There might be many data requirements associated with a requirement.
- DBTable: Each DBTable row represents a table in the integrated database.
- DBColumn: Each row of this table describes a column in the integrated database, with descriptive attributes drawn from the database catalog (in our case SQL Server’s INFORMATION_SCHEMA.COLUMNS view).
- Interface: This table contains rows describing external interfaces with descriptive attributes, including an indicator of whether the interface is incoming or outgoing.
- Field: Each row of this table describes a data element delivered by the interface (or to the interface if it is outbound). Descriptive attributes include data type, length, whether the field is required, text describing the field’s allowable values, and a FieldGroupNameText identifying the subgroup that the field occupies in the Interface’s format definition. The metamodel could be extended if it needs to account for more complex interface data structures.
- ReqmtMapping: This is the associative table representing the many-to-many relationship between data requirements and integration database columns. The table includes columns identifying the matched DataReqmt and DBColumn, as well as a text “Condition” column detailing the nature of the relationship if the column partially or conditionally fulfills the data requirement.
- InterfaceMapping: Each row of this associative table corresponds with a row on the source-to-target spreadsheet, showing the connection between a DBColumn and an incoming or outgoing interface field. In addition to attributes identifying each related column and field, text attributes house required transformations and additional comments.
Each table in the metamodel includes typical audit columns enabling administrators to track inserts and updates, with the associated user ID.
Loading and Using the Integration Meta-Database
Data integration teams are rich in the skills required to deploy, load and use the integration metadata database. The contents of DBTable and DBColumn can be inserted via simple INSERT queries from INTEGRATION_SCHEMA.COLUMNS, as shown (again, assuming SQL Server – all leading DBMS offerings have similar catalog tables). It is also a simple matter to load the source-to-target documents and requirements spreadsheets to their respective tables using SQL Server’s Import and Export Data Wizard, which steps the user though development of a repeatable SQL Server Integration Services job for loading the requirements and source to target mapping spreadsheets.
Once the integration metadatabase is populated the team can begin to use basic SQL queries to answer questions that formerly would have required tedious, error-prone spreadsheet-to-spreadsheet comparisons:
- Will the defined interfaces load all expected database tables?
- Does the database as designed meet all data requirements?
- Which interfaces will I affect if I redesign a table?
- If the quality of a given source data element is found to be poor, what requirements are at risk?
Using Integration Design Metadata on Your Projects
Bringing data integration requirements, data model and interfaces together into a metadata database can substantially benefit your integration project, but as always there are a few caveats and potential surprises to watch out for.
Integration Project Benefits
We’ve already described the core benefits to quality control and change management. The team can run SQL queries to quickly ensure coverage of requirements, ensure full population of the core database and outbound interfaces, and assess the impact of changes to requirements, database design and interfaces. In addition, side effects help improve the project’s likelihood of on-time delivery of a quality result:
- Improved process consistency and rigor: The requirement that analysis and design artifacts be loaded to a database means they must be consistent and correct, at least in terms of format. On large projects, different subteams develop different requirements definitions and ETL mapping conventions. Loading these artifacts to a database imposes a rigor on the process that otherwise requires much manual review to enforce.
- Improved quality of analysis and design deliverables: Using the metadatabase to enforce foreign key relationships, making it impossible to add, for example, a data requirement for which there is no requirement, or a field that has no associated interface, ensures that the working spreadsheets have no loose ends caused by unconnected columns, fields and data requirements.
- Improved communications: Often, business and technical team members divide into separate camps that don’t communicate as well as they should. The relationships needed in the metadata database require business and technical participants to communicate, even if only to correctly map requirements to columns. That communication alone substantially improves specificity of requirements and business alignment of the database.
Applying the Metadata Prototype
As always, these benefits don’t come for free. Teams implementing a metadata solution like this one should be aware of these considerations:
Metadata management requires resources: While metadata development, maintenance and administration are well within the data integration team’s skill set, they are nontrivial tasks. Level of effort depends on project size and specifics, but estimate at least the same level of effort allocated to maintenance of tracking documents on a typical agile project.
The prototype is functional but not production-ready: Generally, teams should revise the prototype code and spreadsheet formats to conform to site conventions and standards. Teams should especially:
- Apply appropriate metadata change management: The prototype includes staging targets for the SQL Server Integration Services jobs that load the requirements and source-to-target mapping spreadsheets. Insert statements migrate data from staging to the metadata database assuming they are populating the tables from scratch. Your project may need to delete then insert relevant rows, or update or insert depending on whether or not the given spreadsheet row already exists.
- Apply effective/end dating if needed: The prototype metadata tables include typical audit columns, but are not effective dated. If your team requires, for example, the ability to reverse a project change and go back to an earlier state, or to track the lineage of a requirement and its associated columns and fields, you might want to add effective and end dates to each table.
Additional rigor requires additional management: Any project methodology change requires planning and management on the part of team leadership, and requires them to make a choice between fighting off alligators or draining the swamp. Leaders of projects using a metadata solution must demand the additional rigor required to support the needed processes and standards, and reject work that doesn’t meet the requirements and ETL mapping standards that the metadata solution requires. The reward is the ability to run a few queries and get the answers to the PM in time for tomorrow’s steering committee meeting.
Bob Lambert is a consultant with CapTech Ventures specializing in data management and business intelligence strategy, architecture and management services. You can reach him at firstname.lastname@example.org or email@example.com.
He also blogs on business-aligned IT at http://robertlambert.net.