Top 5 reasons you need a Persistent Layer in your Data Warehouse
At Dimodelo, we have been hard at work, redesigning the Data Warehouse data management architecture that Dimodelo Data Warehouse Studio (our Data Warehouse Automation tool) generates.
Working with our clients we have introduced a Persistent Layer into the data warehouse architecture.
“What is a Persistent Layer?” I hear you ask.
The persistent layer contains as set of persistent tables that record the full history of changes to the data of the table/query that is the source of the Persistent table.
The source could a source table/file, a source query, another staging table or a view/materialized view in the transform layer.
In a persistent table there may be multiple row versions for each row found in the source. Each row version has an effective date and end date marking the date range of when that row version was effective (or in existence).
Technically speaking a persistent table is a bi-temporal table. A bi-temporal table permit queries over two timelines: valid time and transaction time. Valid time is the time when a row is effective (i.e row effective and end date-time). Transaction time denotes the time when the row version was recorded in the database. The persistent table supports transaction time by tagging each row version with an inserted and updated batch execution Id. The batch execution is associated with a date time in the batch database. Note that the Dimodelo Data Warehouse Studio version of the bi-temporal table goes one step further by identifying a last updated transaction date time also.
How is a Persistent layer different to an ODS, Inmon EDW, Data Vault?
A Persistent Layer has some similarities to all of these techniques, in that, it is a precursor to a Dimensional layer. It could be described as a temporal ODS. However there are some key differences:
- The temporal nature of a Persistent layer is different. A Data Vault will contain some temporal information, but not effective dates. The other techniques don’t traditionally hold history.
- An ODS and EDW, and to a certain extent, a Data Vault, are designed to be queried by end users. A Persistent layer’s primary purpose is to support the higher layers in the Data Warehouse. You could query a Persistent layer directly, but, because of the temporal nature of the data, it becomes difficult to write the temporal joins. Luckily Dimodelo also set a Is_Latest flag for every row version, so it’s easy to query current state,and thus emulate an ODS.
- The extent to which these layers are modeled is different. The EDW and Data Vault techniques expect to be extensively modeled. This takes considerable effort on the behalf of the developer, first to model the layer, and second to transform the data from the source into the new model. In a persistent layer, you can model this layer as much or as little as you like. Again its purpose is support higher layers. If you need, you can create entities representing aggregations, allocations and new relationships etc in this layer. You might do this to support the Dimensional layer, or for a specific reporting purpose. We do recommend that the persistent layer be at least organised, into source systems, or entity domains, or both.
Why you Need a Persistent Layer
1. All history, all of the time, instead of some history some of the time
Including a persistent layer in your architecture is a paradigm shift in how you see the data warehouse. In the popular Kimball methodology, without the persistent layer, the data warehouse layer was responsible for persistence. This was problematic, because it only recorded some history, for some entities and for some attributes that were the subject of reporting at the time. If later, the history of another attribute was required, that history simply wasn’t available. In addition, if the logic used to calculate an attribute or measure was wrong, then the source data to recalculate that attribute/measure was no longer available.
With a persistent layer, you don’t need to know up front what your business requirements are for history. If, at some time in the future, a user decides they want type 2 changes for a dimension attribute, In Dimodelo Data Warehouse Studio, simply change the attribute to type 2 and re process. The attribute is reloaded with full history. You could do a similar process with a manually developed solution.
In-fact persistent layer gives you the flexibility to completely drop and rebuild your data warehouse layer with full history at any time, or just rebuild one dimension or fact. Your data warehouse becomes a true ‘presentation layer’ that can adapt without fear of data loss. It provides the ability to re-load with full history if required (due to a change in logic, model or mistakes).
2. ETL Performance
Another key element of a persistent layer is performance. There are two (in my opinion) base strategies for improving ETL/ELT throughput. One is only process changed data (i.e. the delta change set), and the other is parallelism. Because a persistent layer is able to identify what changed and when, it is possible to identify the changed set of data. Therefore, all higher layer (e.g. Data Warehouse layer) can be incrementally updated from just the changed set of data. For example, a persistent table may have 500M rows, but on a daily basis the change set might be just 250k rows. This has a big impact on overall performance.
It also means that your extract logic can also be incremental. Because you have a full set of data in the persistent layer, feeding the Dimensional layer, only changes need to be extracted from the source to update the Persistent layer. Dimodelo Data Warehouse Studio supports 3 incremental extract patterns. Without Persistent Tables, it was difficult to do complex transform logic over the data from incremental extracts into ordinary transient staging tables. These staging tables would, on any given day, only have a small subset of rows from the source table/query. It was impossible to join these tables to other tables to get some other representation of the data that needed more than just the subset that was available.
3. Provide new types of accurate time based analysis
A persistent layer will support accurate Time-based insight like Churn. For example, in an Aged Care example, on average how often is the visit schedule changing in the 3 days preceding a home care visit. By reducing the churn of the schedule, an Aged Care provider could improve operational efficiency. This is a case where a report/dashboard might be written to source its data directly from the persistent layer.
Another example I’ve seen is Electricity Network Reliability reporting, where is necessary (from a political/legislative point of view) to accurately record individual electrical outage information, the actions taken over time to rectify, and the outage impact on consumers over that time period.
Once you have a persistent layer, the possibilities for analysis and insights broaden. A persistent layer is also a very good source of data for machine learning algorithms.
4. Evidence !
How often have you been asked “why has my report changed”? Often the answer is that source data has change. A persistent layer gives you evidence you can point to (literally) to explain why a report has changed. Some of our clients are doing this on a daily basis.
5. Auditing
In the age of GDPR, Sarbanes-Oxley etc, being able to provide an accurate history of change in your systems is imperative. A persistent layer provides that history. Only with a bi-temporal database can organizations maintain a complete and accurate picture of the past to understand exactly who knew what and did what, when.