Why you need a Persistent Layer in your Data Warehouse – Top 5 Reasons
Why you Need a Persistent Layer
1. All history, all of the time, instead of some history some of the time
Including a persistent layer in your architecture is a paradigm shift in how you see the data warehouse. In the popular Kimball methodology, without the persistent layer, the data warehouse layer was responsible for persistence. This was problematic, because it only recorded some history, for some entities and for some attributes that were the subject of reporting at the time. If later, the history of another attribute was required, that history simply wasn’t available. In addition, if the logic used to calculate an attribute or measure was wrong, then the source data to recalculate that attribute/measure was no longer available.
With a persistent layer, you don’t need to know up front what your business requirements are for history. If, at some time in the future, a user decides they want type 2 changes for a dimension attribute, In Dimodelo Data Warehouse Studio, simply change the attribute to type 2 and re process. The attribute is reloaded with full history. You could do a similar process with a manually developed solution.
In-fact persistent layer gives you the flexibility to completely drop and rebuild your data warehouse layer with full history at any time, or just rebuild one dimension or fact. Your data warehouse becomes a true ‘presentation layer’ that can adapt without fear of data loss. It provides the ability to re-load with full history if required (due to a change in logic, model or mistakes).
2. ETL Performance
Another key element of a persistent layer is performance. There are two (in my opinion) base strategies for improving ETL/ELT throughput. One is only process changed data (i.e. the delta change set), and the other is parallelism. Because a persistent layer is able to identify what changed and when, it is possible to identify the changed set of data. Therefore, all higher layer (e.g. Data Warehouse layer) can be incrementally updated from just the changed set of data. For example, a persistent table may have 500M rows, but on a daily basis the change set might be just 250k rows. This has a big impact on overall performance.
It also means that your extract logic can also be incremental. Because you have a full set of data in the persistent layer, feeding the Dimensional layer, only changes need to be extracted from the source to update the Persistent layer. Dimodelo Data Warehouse Studio supports 3 incremental extract patterns. Without Persistent Tables, it was difficult to do complex transform logic over the data from incremental extracts into ordinary transient staging tables. These staging tables would, on any given day, only have a small subset of rows from the source table/query. It was impossible to join these tables to other tables to get some other representation of the data that needed more than just the subset that was available.
3. Provide new types of accurate time based analysis
A persistent layer will support accurate Time-based insight like Churn. For example, in an Aged Care example, on average how often is the visit schedule changing in the 3 days preceding a home care visit. By reducing the churn of the schedule, an Aged Care provider could improve operational efficiency. This is a case where a report/dashboard might be written to source its data directly from the persistent layer.
Another example I’ve seen is Electricity Network Reliability reporting, where is necessary (from a political/legislative point of view) to accurately record individual electrical outage information, the actions taken over time to rectify, and the outage impact on consumers over that time period.
Once you have a persistent layer, the possibilities for analysis and insights broaden. A persistent layer is also a very good source of data for machine learning algorithms.
4. Evidence !
How often have you been asked “why has my report changed”? Often the answer is that source data has change. A persistent layer gives you evidence you can point to (literally) to explain why a report has changed. Some of our clients are doing this on a daily basis.
5. Auditing
In the age of GDPR, Sarbanes-Oxley etc, being able to provide an accurate history of change in your systems is imperative. A persistent layer provides that history. Only with a bi-temporal database can organizations maintain a complete and accurate picture of the past to understand exactly who knew what and did what, when.