There has been a lot of buzz about hybrid tables and even "reverse hybrid tables" or "hybrid tables for poor people" about a year ago.
Since then it has been pretty quiet. Chris Webb blogged about a new feature called dataCoverageDefinition a couple of months ago and I had the perfect use case for it.
One of our customers wanted show data from years back but also for today within one KPI (or table) and this should be refreshed every 5 minutes via Power BI's automatic page refresh.
A hybrid table with imported data from yesterday or earlier and the current day partition in DirectQuery was the obvious choice.
Our data source was Snowflake and we had serious performance issues. I was expecting when we queried only historical data (the import mode partitions) that this would be blazingly fast and that only if we query today's data (which was in DirectQuery) it would be slower.
It did not behave like this however, it was always slow.
It even seemed to be slower than if I had everything in DirectQuery (never figured that one out to be honest).
When I saw Chris' blog post I thought this might solve our issue since he wrote that we can now explicitly tell the engine to only fetch the DQ partitions by defining this new thing called dataCoverageDefinition.
If you want to know more, please read his blog post here.
This is my model/setup (simplified):
The fact table is the one I set up Incremental Refresh for.
The dimension table is in DirectQuery but then I set it to dual mode.
Here is what I did step by step to define the dataCoverageDefinition
1) I defined the Incremental Refresh policy in Power BI Desktop (by right clicking on my fact table and choosing Incremental Refresh)
Make sure to select "Get the latest data in real time with DirectQuery" in the optional settings as shown at the bottom of the screenshot below
2) Now upload your report to a Premium workspace (Incremental Refresh is a Premium feature)
3) Perform a refresh of your semantic model. This will create the partitions we need.
4) Open SQL Server Management Studio and login with your Microsoft account
5) You will now see a list of databases on the left - select the Power BI model we just uploaded and right click on the database icon. Select the following
6) You should see something like in the screenshot below, your compatibility level will likely be lower than 1603, change it to at least 1603.
Then click on Execute
7) Now we can repeat step 5) and we should see the compatibility level at 1603 (or higher).
8) Let us do a test of our report now to see what is happening before we change anything
As you will see, the performance is not what we might expect
10.05.2024 is in DirectQuery
09.05.2024 is in Import mode
08.05.2024 is in Import mode
07.05.2024 is in Import mode
06.05.2024 is in Import mode
...
Even when querying only data in Import Mode it is sending queries to my database (Snowflake) which is not what I want.
How do I know it is sending queries to the database?
Well, I can check in the Snowflake Monitor what is being sent to my source.
Even though I would expect Power BI not to send a query to the database if I only query the import mode partitions, it is not working like that. This is where dataCoverageDefinition comes into play. This attribute is supposed to make sure we dont send any queries to the database when we load data from our imported partitions.
9) Let us now define the dataCoverageDefinition. Repeat step 5). Find your partitions then go all the way down to the last one. For me it looked like this before changing anything.
As you can see at the bottom, only the last partition is in DirectQuery mode which is what I intended when I set up incremental refresh.
10) Now I will add this code snippet to that partition to add the dataCoverageDefinition
"granularity": "day"
},
"dataCoverageDefinition": {
"description": "DQ partition only for TODAY",
"expression": "'fact'[date]=DATE(2024, 5, 10)"
}
}
Make sure to overwrite the "granularity": "day" part so you do not have it in there twice. Once you pasted in the code click on "Execute".
You should see something like this as a message that everything worked.
If you have not set the compatibility level high enough you will get an error at this point saying "unrecognized JSON property datacoveragedefinition".
11) Now we need to do a Process Recalc on the database - right click on the database and choose "Process Database"
Then performa a Process Recalc
12) Let us verify everything is there by opening up our model as we did in step 5)
As you can see, the last partition now has this dataCoverageDefinition attribute where we defined an expression for what it should be valid for. I know that I fixed it for a day which is not ideal but unfortunately nothing else has worked.
13) Now let's see how this changes our report performance
As you can see, the previous days (since they are in import mode) load super fast and only the current day is slow.
I was also able to verify what is going on in the Snowflake monitor as well. I could see that it only sent one query which was for the 10.05.2024 since this is in DirectQuery.
Up until this point everything still makes a lot of sense.
14) Next I want to change the expression to make it dynamic because I cannot change this manually every day or build some kind of a workaround.
I had the data engineers add a column DATUM (I renamed it "Date as text" in my table) which returns TODAY for the current day and otherwise just gives the respective date as text. By doing this I had hoped to be able to set the dataCoverageDefinition to TODAY and it would work tomorrow as well for tomorrow's data.
Btw if you forget to do a Process Recalc after adjusting the dataCoverageDefinition you will get this error thrown when you open the report.
This is how my partitions looked after I tried to adjust it and did the Process Recalc.
This should work according to the documentation but for me it did not.
I also checked the details and ran a query to see what is going on but the result looked just like with the fixed date (steps 10-12) and no errors were thrown.
At this point I also tried other variations like MAX(date) etc. which all resulted in failure. It was slow and still every time sent queries back to Snowflake.
The only case where the dataCoverageDefinition worked for me was with the date hard-coded which was not really helpful (steps 10-12) since I would have had to update the attribute every day somehow.
Conclusion
The dataCoverageDefinition is a cool new feature that might finally make those hybrid tables work better. For me at least it only functioned when I hard coded the date. I will be following the discussion on this and especially what Chris Webb, who helped me along the way, will write about this topic in the future.
BTW, out of practicality we actually went along with a pure DirectQuery approach which was somehow faster than the hybrid table (without a working dataCoverageDefinition)
Comments