Inherent Data Leakage in Microsoft Fabric Business-Led Development
Microsoft Fabric is an end-to-end analytics and data platform that covers a wide range of functionality, including data movement, processing, ingestion, transformation, real-time event routing, and report building. The platform allows business users of all technical backgrounds to create, process, and store data and build powerful business tools from a unified platform.
Unfortunately, Fabric users can also build reports based on data that exists within semantic models that store more information than is needed, and can easily be extracted using basic prompt injection via the Q&A functionality.
Zenity researchers have uncovered an inherent flaw in Microsoft Fabric that causes data leakage from easy-to-create and easy-to-use reports. We reported these findings to Microsoft last month (May 2024), but were told that these flaws were by design. The questions that we want security people to be aware of in this instance are:
- How does this happen?
- What can I do about it?
Understanding the Context
Before diving in, it is important to note that ‘semantic models’ and ‘reports’ are far and away the most commonly created assets in Fabric, and the synergy between these objects is fundamental to how Fabric works.
Semantic models are essentially underlying data sets that can be tapped in order to serve reports. It’s very common for builders to configure reports to pull fractions of the semantic model’s data in order to fulfill a certain task or objective. People that interact with these reports can then use native question & answer functionality to easily process and analyze the data. The Q&A feature is also widely used and adopted within Fabric to introduce more LLM capabilities, and as is often the case, introduces a lot of risk that security teams need to reckon with.
Now that we understand that, let’s show where it all goes wrong…
Reproduction Step: Create a Semantic Model
Let’s take a scenario where a marketing team wants to analyze customer acquisition data to run more timely marketing campaigns. First, anyone on the marketing team can use Fabric to create a semantic model, or a logical description of an analytical domain that includes metrics and data using natural terminology that can be easily understood and analyzed. In the example below, we see a semantic model that contains customer information, namely, the main contact, their contact information, the deal size, and the start/end date of each individual customer.
Create a Report
Next, still using Fabric, this marketing professional can then build a sample report that uses the data from the semantic model to help members of the marketing team pull relevant data about customers to analyze it.
However, because some of this data is not relevant for this analysis, namely the deal size, the marketing leader excludes deal size from the report. This information is also proprietary and cannot get out to the general public, because the organization is publicly traded and has not released its latest quarterly earnings report. So, they’ll build a report to exclude that data that speaks to overall revenue.
To allow the marketing team to interact with this data, the marketing lead then builds in Q&A functionality for this report, and people can ask questions around ‘who is our newest customer?’ and ‘when is the most common month for customers to sign contracts?’ to allow the marketing team to gain these insights quickly and efficiently.
Share the Report
Next, in order to get this report out in the wild and ready to use, the report builder goes to ‘Manage Permissions’ and can share the report directly with other users, or provide a shareable link, much like someone would share a document. They then quickly toggle the permission to ‘Read’ which makes the data restricted, and ensures that people who interact with that data only see the data available in the report (or so it should go…)
Testing the Report’s Constraints
Breaking the fourth wall now, let’s go in and see if this holds up. Using the link that was sent to us, or, via the ‘Browse’ option in Fabric, let’s open up the report, and view the semantic model.
Now, we can observe that underneath the Tables, we can view the metadata structure, even though the actual data is, or should be, obscured from our view.
When we compare the semantic model columns against the columns in the report, it’s plain to see that the ‘Deal Size’ column was excluded from the report… but still exists in the semantic model where we can see it in plain view.
This makes sense, as the underlying semantic model is the central hub for all things customer data, but the report we’re building doesn’t need that information. Let’s go back to our situation and play it out.
Using the Q&A to Leak Data
Putting ourselves in the position of someone on this marketing team, let’s say I want to analyze this data to answer my manager’s question about when customers are typically being acquired most frequently. Even if I am a trusted insider, it still is not pertinent to this task to know the customer size, but as a marketer, maybe you want to look at trends for your biggest customers, so inside of the report, you use the Q&A functionality to ask “What is our biggest customer?”
Knowing that the report was set up to directly exclude this data, we should expect to get some sort of failure message, but in reality, we saw anything but.
We can see with the results above that this prompt called the ‘Deal Size’ column despite it not being reflected in the report, and yielded an accurate response with the specific value of the biggest customer, which in this case, violates the principle of least privilege at best, and at worst, leads to sensitive corporate data being exfiltrated leading to data loss and/or compliance failure.
How Does This Happen? And Why?
When sharing a report in Fabric, the semantic model is shared to anyone who has access to the report by default. The Q&A functionality inherently bypasses the report’s data restriction and uses the full scope of the semantic model as a source of truth. Other users can then ‘innocently’ use questions as a way of prompting the app to call the data source to gain access to data restricted from the report. Other users, who may not have as well of intentions, can see this data and use it to their advantage.
In case the report is shared via a link it’s by default shared with the entire organization, making this scenario much worse. Further, as semantic models can serve multiple reports, it’s very common to end up with a ‘supply chain’ of reports where this over-sharing occurs and more and more violations happen.
Not only is this vulnerability easy to configure, as it is a default setting, there is a scale issue at play here as well. As more and more people are using these types of tools to automate processes and analyze data, there are more reports, models, and apps that have access to underlying sensitive data. Because there’s no SDLC or CI/CD pipeline tools involved in the development of these reports and semantic models, there’s no way of seeing this data leakage in Microsoft Fabric until it’s too late. And given that ‘you’re only as strong as your weakest link,’ in this case, there are many chances for failure that can result in data leakage.
Data Leakage in Microsoft Fabric. What Can I Do About This?
As of today, the answers are invariably frustrating. The two things that are suggested in order to fix this, is to either remove the Q&A objects from reports, or scope down the data set. However, both inherently slow business operations and clog up innovation; which is the very purpose of low-code tools like Microsoft Fabric (Power BI). The real answer is to first gain an understanding of all the different flows, models, and resources that are developed using Fabric, then understand the business logic and reasoning behind why each app was developed and for what purpose. Only then can security leaders get to the root of the issue and start remediating existing blind spots and vulnerabilities, and stop data leakage stemming from Microsoft Fabric. If you’d like to chat about this issue, feel free to contact us, or sign up for our newsletter below to stay in the loop for all security issues stemming from low-code and copilot led development.