Top Privacy Concerns in adopting GenAI & practical tips to mitigate them
ChatGPT is powerful but your data is sensitive & critical.
Written By: Chakravarthy Varaga
"Despite the enthusiasm, enterprises are slow to adopt commercial LLMs — like GPT provided by OpenAI — as they share several concerns. In fact, less than a 1/4th of surveyed companies are comfortable using commercial LLMs in production.At a high level, data privacy concerns top the list. In our discussions, nearly 40% of companies voiced concerns about sharing proprietary or sensitive data with LLM vendors"
Survey conducted by Predibase with over 150 CXOs/Leads in adopting GenAI in their organisations.
From my discussion with multiple Heads/DataScientists across cos, one approach that everyone talked about is to fine-tune open source models such as Llama2 and host it in the private infrastructure. This may help alleviate from sharing sensitive data with public vendors like OpenAI/ChatGPT/Mistral/Anthropic.
However, In an adverse event of gaining access to your model weights, a malicious actor could extract your organisation's sensitive data from the model.
Your Generative AI model is an asset. Treat it like one.
Practical Tips & Necessary GuardRails to Mitigate Privacy Concerns
Zero-Data-Loss and Data Anonymisation
Personal, financial, health related or sensitive data cannot be fed into
- Prompts
- ChatGPT, OpenAI/Public LLM Endpoints
- Internal LLM Models
- APIs & 3rd Party Systems
Ensure it is anonymised to protect individual identities such as name, age, email, phone number, personal identities, health information, credit card, expiry dates, cvv etc. and organisation sensitive information such as Revenue numbers, business strategies, financials, brand values etc.
- Techniques such as data masking can prevent the disclosure of personal information.
Synthetic Data
It is possible to extract training data including sensitive confidential information from pre-trained model using simple attack vectors.
Use synthetic data(pseudo anonymisation) as a replacer to real sensitive data to fine tune a LLM. This way your data even if extracted does not leak confidential information.
- Ensure that the synthetic replaceable data is contextually relevant, personalised & biased with your business context, (eg. you sell products in Singapore, have the products, brands relevant to the region)
Data Moderation with confidential terms detection
Every business has a set of key/valuable terms that are sensitive & proprietary to the organisation. It could be your marketing strategy, revenue numbers, high-premium customer segments.
- Your chat prompts and the model end-point needs a moderation layer that should detect these sensitive terms and blocks them through pre-defined policies.
This privacy layer provides a single pass sensitive/personal data identification, redaction, replacement with fake data that is contextually relevant and coherent. The responses from the LLMs have to be moderated to detect malicious code.
Data & Model Governance
Implement strict access controls through a governance framework to limit who can input data into the GenAI system and who can access the outputs. Ensuring that only authorised personnel have access can significantly reduce the risk of data breaches.
- The moderation layer can be intelligent to incorporate your organisation's authorisations & privileges to access resources.
- Treat your model & the necessary data that goes in and out of the model as another set of resources.
- For eg. this could be your HR policies which are available for access to only a certain grade and above. Compensation for a grade like 'G6' cannot be made available to any grades below 'G5'
Privacy-by-Design
Adopt a privacy-by-design approach at every stage of the development process, from initial design, model design, pre-training/fine-tuning process to deployment, inference & GA access, ensuring that privacy protection is baked into the technology.
Centralised Inventory, Catalog of GenAI Implementations
Have a repo of all GenAI implementations that tracks
- the models and their checkpoints
- datasets used
- productionalised versions and their purposes (explainability)
This repository and catalog listing improves transparency & explainability of your models & their usage across the organisation.
Regular Audits and Compliance Checks
Conduct regular audits of your GenAI systems to ensure they comply with data protection laws such as GDPR, CCPA, DPDPA or any other relevant legislation.
- This includes reviewing data handling practices and the model's outputs for any potential privacy violations.
Transparency and Consent
Be transparent & explainable with your stakeholders about the use of GenAI technologies and the data it processes.
- Obtain explicit consent from individuals whose data may be used, clearly explaining how their data will be handled and for what purposes.
- Keep your "Data Principles" informed of the data used. These are your customers/users of your services.
- Keep your Compliance & Risk Officer in loop for all sources of information including the data that you train on the model
We are building a customisable privacy layer with necessary guardrails, policy based configurable detection & moderation capabilities needed to secure Business adoption of GenAI.