Data Driven Design with Agile for Data Warehousing
This article is an introduction to using Data Driven Design and the Agile Project Methodology to quickly deliver a Data Warehouse.
What is Data Driven Design for Data Warehousing
Data Driven Design refers to a design technique that focuses on examining the data and database schema of Source Systems to understand the Business Processes and data of the Business. From the knowledge gathered, it becomes possible to design a Data Warehouse that captures these Business Processes. It’s best used in combination with an Agile development methodology, where multiple iterations of the project improve and refine the Data Warehouse solution. Data Driven Design has become a feasible option because of a new generation of Data Warehousing tools that generate a Data Warehouse direct from a design, rather than relying on complex manual coding. This allows for a highly flexible solution suited to iterative development.
Why use the Agile Methodology for Data Warehousing Projects?
Many IT projects fail. Depending on who you believe, its anywhere between 50% and 80% (See references below). Shocking! I believe that you need to take these figures with a grain of salt however. It depends on your definition of failure (See discussion on failure below).
It has also been shown that small projects are more successful than large projects and those projects using the Agile methodology are also more successful (See references). Given that small agile projects are the most successful, and that Gartner states that over 70% of BI projects are not successful, it follows that applying an Agile project methodology to a Data Warehouse project is a sensible strategy. Data Driven Design is a perfect companion for Agile.
Why Data Driven Design for Data Warehousing Projects?
Data Warehouse projects have certain characteristics that make them suitable for Data Driven Design. The key characteristic is that Data Warehouse projects are highly constrained. They are constrained by the data contained in the source systems of the Data Warehouse, and, from a requirements perspective, a Data Warehouse is constrained to modelling existing business processes (other than perhaps reporting and management processes).
Unlike an operational system project where development is about new or changing business processes, it is not realistic to get requirements about, for instance, what attributes of a Customer that need to be captured, and then model into a Data Warehouse Data Model. First… the underlying Source System may or may not support the attributes and secondly it is easier to go directly to the existing Source System and discover them.
Data Driven Design will shortcut the requirements process. Clearly existing Business Process will be manifest in one or more Source Systems, and can be ‘discovered’. Thus a Data Driven Design approach can be taken, using existing data to derive a design for the Data Warehouse. Data Driven Design doesn’t mean ignoring business requirements all together. To the contrary the business needs to be brought along on the journey, but only limited and specific input and prioritization is required from the business in the early iterations. In later iterations the business will be better equipped to properly articulate their requirement.
Overview of Data Driven Design for Data Warehousing
Below is an outline of Data Driven Design and Agile Methodology applied to Data Warehousing.
First Iteration
The first iteration is a little different. It is as much a business education process, as it is a development exercise. The four steps of the process are described below. They will occur in roughly this order, but it’s likely that some steps will occur simultaneously and repetitively as required.
- Discover the Business Processes that are included in project scope. Data Driven Design doesn’t mean ignoring business requirements all together. To the contrary, the business needs to be consulted constantly and brought along on the journey. A detailed description of the Business Process is not required. Just a simple accurate story card for each Business Process and Business Event, identifying the key entities involved (The Who, What, Where, When and Why) and how the Business Events are measured. I could start to write about Ontologies, Semantics and Taxonomies here… but that’s another post! The identified Business Processes need to be prioritized in a business value order. This order needs to be weighed against the technical complexity of each Business Process to decide which Business Processes will be implemented in the first iteration of the project. Try to choose a relatively high value and simple Business Processes for the first iteration.
- Discover the Source Systems for the selected iteration 1 Business Processes and get access to their database/files etc. Having real life data is important, and can sometimes be a challenge.
- Discover within the Source System database(s) the manifestation of each of the Business Process. This will require communication back and forth with the business representatives. Think of yourself as a data detective. A forensic data detective, CSI style! Quite often Source Database tables and columns are well named, and easy to understand. In many cases, the Database schema is undecipherable. Many packaged ERP suffer from this. Every Holmes needs a Watson. Collaboration with a data focused Subject Matter Expert (SME) or experienced Source System programmer is essential. A data focused SME will have experience with the underlying data, perhaps learned from extracting data from the Source System Database for reports. This step will identify the tables, columns, data and grain of data involved in each Business Process. It should also discover various cases and example of the Business Process within the data. Save these cases if possible, they become useful for testing later. Physically, this step will involve a lot of data profiling and querying of the Database. Other resources you will find useful are Source System user manuals, technical manuals, a data dictionary or existing reports that extract data.
- Develop a first cut Data Warehouse Data Model and ETL based on what has been discovered. The Data Model will contain only those tables required for the first iteration, but must conform to good Data Warehouse design principles, so that the model can be easily expanded in the future. It is absolutely essential that a next generation Data Warehousing tool like Dimodelo Data Warehouse Studio is used to develop the Data Warehouse and ETL. The Data Warehouse and ETL is going to go through many iterations, much change, plenty of regression testing and several releases. Short iterations are just not feasible without a tool like Dimodelo Data Warehouse Studio.
The idea of the first iteration, and to a lesser extent, other early iterations, is to give the business a first look at the Data Warehouse. This helps to get them thinking dimensionally, and improve their ability to articulate requirements. Gathering requirements is traditionally a major issue in Data Warehouse projects. The first iteration is an education process, helping the business understand the capabilities of BI. We recommend you demonstrate standard reports, dashboards, scorecards and ad-hoc analytics. Only deploy the first iteration to a sandpit environment. Somewhere the users can ‘play’. Keep expectations of data accuracy low, it’s unlikely you will nail the data 100%. It’s possible you may even need to start from scratch again, but that’s not an issue with Dimodelo Data Warehouse Studio or other similar tool.
Subsequent Iterations
Rinse and Repeat. Subsequent iterations follow the same process as the first iteration but the focus of subsequent iterations is to improve and refine the existing solution based on feedback from previous iterations. Only introduce new functionality for new Business Processes when resources allow.
Subsequent iterations introduce additional Agile disciplines, like Test Driven Development, Continuous Integration, Source Control and Release Management, each of which could fill several blog posts on their own. It does suffice to say, automation is the key.
It’s likely that subsequent iterations become release candidates for release into production. Again discipline is required. Releasing a Data Warehouse into production is not an easy task, especially if there is an existing version already in production. At this point it is also important to focus on non-functional requirements like security, disaster recovery etc. Usually the planning for non-functional requirements would have already occurred, and there is nothing wrong with that. However, where there is a fairly standard architectural scenario, there is an argument that it may not be wise to invest time and money into the architectural tasks until a viable Data Warehouse and BI solution that meets the business needs has been proved.
.. Now… if only we could get the PMO, the Op Ex and Cap Ex budgets, and the business to align with the uncertainty of an Agile project… A whole other blog post again.
The Criticism of Data Driven Design
In the past Data Driven design has been used on Data Warehouse projects to the exclusion of all other techniques, and without using an Agile project methodology or Agile Data Warehouse development tools like Dimodelo Data Warehouse Studio. User involvement was avoided or minimized (which suited IT). It was mistakenly thought that Data Warehouse development meant simply re-modelling multiple source systems into a single Data Warehouse model. Only after the system was built, using inefficient, error prone, time consuming and inflexible manual coding practices were the users engaged properly. Unfortunately, more often than not, the results miss the mark, and failed to answer the business questions. With an inflexible ETL framework that takes months to make any significant change, the project team found themselves in trouble, and potentially out of budget.
The real problem was the time the development team took to develop the solution and the inflexible nature of the resulting system. Shorten the development cycles, use an Agile Development tool, and call the result a prototype, and a paradigm shift occurs. Iterate multiple times to improve and refine the application to the business requirements.
IT Project Failure rate references:
Failure Rate References
Gartner 70-80% of BI Projects Fail and Report
Small projects are more successful than large projects references:
Gartner Survey Shows Why Projects Fail
Failure of large projects – why it is better to be small particularly in IT
Agile project methodology is more successful references:
Ambysoft Project Success Rate Survey
Agile Business Intelligence Would Be A Good Idea
Agile succeeds three times more often than Waterfall
Definition of Project Failure
The definition of project failure is usually a project that has a combination of the following issues: significantly over budget, significantly over time, doesn’t deliver full functionality, or is cancelled outright. However, I think it’s more subjective than that, and depends on project context. If a Home building project went over budget 20%, or time by 20%, but at the end of the project, the house was everything the client wished for… is this a failure? The answer – depends… For a more in-depth understanding on the subjective contextual nature of project failure you can read The rise and fall of Standish Chaos Report figures.