Persistent Staging Case Study – Payroll Fact in Human Resources Data Warehouse
This post is the first in a series of detailed cases studies discussing the ETL strategies that can be used when a Persistent staging layer is included in your Data Warehouse. It’s intended as a reference for developers using Dimodelo Data Warehouse Studio, our Data Warehouse Automation tool to quickly build a Data Warehouse. It’s most useful for Dimodelo developers who have some initial experience with Dimodelo, either through the Training, or in a commercial application. A general discussion of Persistent Staging can be found in “the top 5 reasons you need a persistent layer in your data warehouse” post.
Overview
The client needed to create a fact that represented Employee Payroll. The Payroll fact contains a row for every pay allowance, for every employee, in every pay period.
The dimensions associated with the fact included:
- Pay Date (role play on Calendar Dimension)
- Pay Period (role play on Calendar Dimension)
- Allowance
- Employee
- Position
- Company
- Reason
The Challenge
The client runs as 2 legal entities. Each legal entity is managed in a different source system. The data is spread across 7 source tables. The client wanted to have a separate fact for each legal entity. Very occasionally pay runs are reversed for some employees. The client only wanted to bring in data from the start of financial year 2017.
The Solution
The general approach was to persistently stage each of the 7 source tables and then join them in the fact transformation. This approach ensures that the Payroll fact table can be reloaded with full history if, at any time the logic changes for calculating measures. This also allows for future advanced predictive analysis (i.e. machine learning etc) over the persistent staging tables, which are keeping all history of change.
Persistent Staging
The primary source of the fact is an Historical Allowance source table, with 1 row per paid allowance. A persistent staging table was created that selected data from a query over this table. The query logic was simple and unlikely to change, so conformed with our recommended scenario 3. The query restricted data to > FY 2017, and used 2 SQL functions available in the source to calculate a Units and Quantity amount. The query also joined to an allowance code table for the type code required by the functions. There was a relatively small amount of data in this table, approx. 2M rows. The Full Extract pattern was used,1.because of the relatively small amount of data and 2. it captures deletes, catering for the occasional pay run deletion.
All the other source tables were extracted as direct 1 to 1 persistent staging tables from source table to persistent staging table with a Full extract. These tables were relatively small. They could have been extracted with an incremental extract (only changed rows extracted from the source) where there was a column in the source that could be used to identify change. The source system was a packaged application, and so, the client was not able to add change tracking to the SQL database. Therefore, the change tracking pattern was not an option.
Payroll Fact
The Fact joins across the 7 source persistent staging tables (as recommended in the guidance). By doing the join in the fact transform, rather than in a View or Materialised View, Dimodelo can generate code that supports a full reload with history of the Fact.
As an aside: If any source table for a Fact or Dimension is not a Persistent Staging table, then Dimodelo can’t rely on the staging table persisting history. Therefore it must generate different code. The generated code can’t rely on the staging layer presenting it with only the delta change set of data, and therefore, does a full compare of the staging source to the target Dimension/Fact to determine the change set during the transform process.
The primary staging table is filtered in the Fact transform to a range of allowance codes. For primary staging tables, Dimodelo generates a sub query filter the primary staging tables prior to joining to any other secondary staging tables.
All secondary staging tables are joined by business key and include the standard temporal join expression, on effective dates. Some secondary staging tables also apply filters. See image below. Filters on secondary staging tables cause Dimodelo to generate an INNER JOIN instead of the standard LEFT JOIN and add the filter expression to the ON Clause. By providing a filter you are telling Dimodelo to do an INNER JOIN. If you want to add additional join logic, but retain a LEFT JOIN, add additional join logic to the Join On box.
The case above is filtering the staging source to just one of the 2 legal entities.
Everything else for the Fact is standard mappings and look-ups.
Because all source staging tables are persistent staging tables. This fact could be dropped or truncated, then reloaded with full history, at any time. The reload would associate the fact rows to the correct version of Dimensions members using the effective date of the Fact row derived from the persistent staging sources.
Dimodelo solutions has in-depth experience in HR Data Warehouses along with many other domains. If you would like help with your Data Warehouse implementation please contact us through our contact page. You can accelerate your Data Warehouse development cycle using our Data Warehouse Automation tool. Download the Dimodelo Data Warehouse Studio free trial and community edition.
Links: