I have decided to write SSIS Slowly Changing Dimension (SCD) Transformation Checklist so you can get most out of it (and I don't forget).
SCD has rather bad reputation and it is accused of being SLOW for larger loads... I say it can be VERY FAST if you now all the tricks... which is not any different from other approaches. Main reason why I like SCD is its simplicity even if it has some weaknesses I often can overcome them but doing some planning and following my processes (Checklists), however bear in mind that there are various ways to achieve the same goals and there might be specific bottlenecks so SCD might not always be the best option but in most cases the difference should not be very big.
NOTE: Feel free to suggest new checklist points.
SSIS SCD Transformation Checklist:
Below is the checklist organized by subject.
SCD is very specific and has its limitations. Main challenge is performance for larger dimension loads. In order to get very good performance for large loads you will need Clustered Index on Dimension and the column must be Business Key. Some of you will straight away question this approach so before that happens it might worth reading TechNet Data Warehouse Query Performance and you can post your thoughts on this subject in this MSDN Forum thread.
- Create Clustered Index on Dimension table using Business Key as column
- You might be able to use an index hint to use non clustered index (I need to verify that)
- Sort Input by Business Key (use the same order as Clustered Index Column)
- This is mainly to improve inserts/updates, probably sorting it before SCD might also improve performance (caching) but I need to verify that.
- SCD will issue a query to the database for each row so it is important to see if this kind of operation is not a bottleneck, you can easily check that by doing SCD on an empty table.
SCD Type 2
Another common challenge is how to handle SCD Type 2 (Historical) without making additional changes after SCD Wizard generates output and here is what I recommend (which might not be what you recommend so feel free to post comments).
- Create column in Dimension Table IsCurrent as BIT
- Use IsCurrent in SCD Wizard to handle Type 2
- Start Date
- Use Default Value on Start Date Field (Date Time)
- If you use Business DateTime than add to Data Flow.
- End Date
- Use View and calculate dynamically.
- Start & End Date. For more info visit Jamie Thompson post Debunking Kimball Effective Dates
Rapidly Changing Dimensions
For Rapidly Changing Dimension (Mainly Type 2) I suggest to look into using Kimball Mini Dimension.
This is part is still work in progress but on high level here are some tips.
- For updates consider storing data in a separate table and performing single update. (I still need to do some test to see if updates can be improved without additional complication of introducing new table).
- Potentially it might be possible to do lookup before SCD and use Hash function so we can detect rows that haven't changed and ignore them before they go to SCD component. This require extra logic and maintenance so I still need to test actual Performance gain. The same might be applicable to new rows (Business Key doesn't exist).
- A made some tests and doing LKP to detect new rows helps but difference in performance between SCD needs to be calculated to see if it is worth it.
- hashing columns should also help but it increases complexity so probably at this point looking at alternative approaches might be a better way to do it.
- Also hashbytes function can provide different result for different data types / collation so this method increases risks.
- For new rows it is faster to put Sorting Transformation (Business Key, sorting according to Clustered Index).
- Change OLEDB Destination to Fast Load and Commit size 0 (one transformation).
- Because we have one transformation we need Sort Transformation (Blocking Component) to ensure update finish before fast load with commit size 0 is executed. (I need to do more testing on that)
- In OLEDB Destination change Load Option and add Order(BusinessKeyField ASC)
SCD is simple; if you plan ahead and it can perform very well if you remember about several performance tips, SCD is not ideal for all projects but I see it working for in the following scenarios:
- Proof of concepts (No SSIS Framework)
- "New to BI" Team who needs to complete a project quickly and accurately.
- What about larger projects? In these metadata driven SCD is probably much more appropriate.
For any other scenarios multiple options can be considered.