Tech Blog

Solving LLM Data Sharing Conundrum with Synthetic Data

8/14/2023

It seems like not a single conversation goes on in our world that doesn’t mention Generative AI ever since the introduction of ChatGPT on November 30, 2022. Numerous such LLMs (Large Language Models) have since emerged and enterprises from every walk of life are either already working with them or looking to as soon as possible. Everyone is convinced that they have to somehow leverage the immense potential of these technologies.

For instance, consider these applications of LLMs:

Code generation: a life sciences company wants to generate code they can use to analyze and plot clinical trial data
Analysis: a digital ad agency sends its client’s data for deeper analysis to decide where to place certain ads for maximum ROI (return on investment)

These companies, however, are faced with the hurdle of not being able to upload their internal data or their clients’ data to GPT-n or other LLMs – the data they have is confidential or they are not allowed to share due to compliance or regulatory reasons. Therefore, they need a way to achieve their objectives while overcoming the challenges.

Though text is omnipresent, most of the data enterprises use for decision making are numerical or categorical in nature. For example, transactions in financial services, demand and supply data in e-commerce/supply chain, telemetry in observability/AIOps, genetic markers & test results & sensor readings in therapeutics/life sciences, packet flows and malware in network & security, and location data in geospatial companies etc. They can be tabular or time series data. This data is typically proprietary and confidential and are not sharable, especially with LLMs that thrive on training their models with any data they can get their hands on.

Though they have these stringent restrictions on uploading this data to LLMs, the enterprises also want the LLMs to do their job with data that looks and feels like the original data – in other words they want to preserve all the statistical properties and dependencies of the data.

This conundrum can be addressed and solved with Synthetic Data. Synthetic data is a generative AI approach that can be used to create any number of data points maintaining the fidelity of the original data, while the individual data points are not identical. The typical enterprise dataset has metadata and measurement fields. Examples of metadata fields are name, gender, address and other demographic data, IP address, Server number and ID. Measurement fields could be CPU utilization, credit card transaction, Latitude & Longitude, blood pressure reading etc.

Modern synthetic data technologies allow optional privacy transformations on metadata fields such as masking, redacting, remapping, randomizing. The measurement or attribute fields, by definition, are private.

These enterprises can, therefore, upload the resulting synthetic data to LLMs to generate code or analysis achieving the intended purpose while adhering to their confidentiality and compliance requirements.

MuckAI Girish is the Co-founder & CEO of Rockfish Data

0 Comments

Blog

Solving LLM Data Sharing Conundrum with Synthetic Data

Author

Archives

Categories

Contact Us