In this article:

This is some text inside of a div block.

Share the Article

How Sentra Accurately Classifies Sensitive Data at Scale

July 30, 2024

Min Read

Data Security

Hanan Zaichyk

Data Scientist

Background on Classifying Different Types of Data

It’s first helpful to review the primary types of data we need to classify - Structured and Unstructured Data and some of the historical challenges associated with analyzing and accurately classifying it.

What Is Structured Data?

Structured data has a standardized format that makes it easily accessible for both software and humans. Typically organized in tables with rows and/or columns, structured data allows for efficient data processing and insights. For instance, a customer data table with columns for name, address, customer-ID and phone number can quickly reveal the total number of customers and their most common localities.

‍

Moreover, it is easier to conclude that the number under the phone number column is a phone number, while the number under the ID is a customer-ID. This contrasts with unstructured data, in which the context of each word is not straightforward.

What Is Unstructured Data?

Unstructured data, on the other hand, refers to information that is not organized according to a preset model or schema, making it unsuitable for traditional relational databases (RDBMS). This type of data constitutes over 80% of all enterprise data, and 95% of businesses prioritize its management. The volume of unstructured data is growing rapidly, outpacing the growth rate of structured databases.

‍

Examples of unstructured data include:

‍

Various business documents
Text and multimedia files
Email messages
Videos and photos
Webpages
Audio files

While unstructured data stores contain valuable information that often is essential to the business and can guide business decisions, unstructured data classification has historically been challenging. However, AI and machine learning have led to better methods to understand the data content and uncover embedded sensitive data within them.

‍

The division to structured and unstructured is not always a clear cut. For example, an unstructured object like a docx document can contain a table, while each structured data table can contain cells with a lot of text which on its own is unstructured. Moreover there are cases of semi-structured data. All of these considerations are part of Sentra’s data classification tool and beyond the scope of this blog.

Data Classification Methods & Models

Applying the right data classification method is crucial for achieving optimal performance and meeting specific business needs. Sentra employs a versatile decision framework that automatically leverages different classification models depending on the nature of the data and the requirements of the task.

‍

We utilize two primary approaches:

Rule-Based Systems
Large Language Models (LLMs)

Rule-Based Systems

Rule-based systems are employed when the data contains entities that follow specific, predictable patterns, such as email addresses or checksum-validated numbers. This method is advantageous due to its fast computation, deterministic outcomes, and simplicity, often providing the most accurate results for well-defined scenarios.

Due to their simplicity, efficiency, and deterministic nature, Sentra uses rule-based models whenever possible for data classification. These models are particularly effective in structured data environments, which possess invaluable characteristics such as inherent structure and repetitiveness.

‍

For instance, a table named "Transactions" with a column labeled "Credit Card Number" allows for straightforward logic to achieve high accuracy in determining that the document contains credit card numbers. Similarly, the uniformity in column values can help classify a column named "Abbreviations" as 'Country Name Abbreviations' if all values correspond to country codes.

‍

Sentra also uses rule-based labeling for document and entity detection in simple cases, where document properties provide enough information. Customer-specific rules and simple patterns with strong correlations to certain labels are also handled efficiently by rule-based models.

Large Language Models (LLMs)

Large Language Models (LLMs) such as BERT, GPT, and LLaMa represent significant advancements in natural language processing, each with distinct strengths and applications. BERT (Bidirectional Encoder Representations from Transformers) is designed for fine-grained understanding of text by processing it bidirectionally, making it highly effective for tasks like Named Entity Recognition (NER) when trained on large, labeled datasets.

‍

In contrast, autoregressive models like the famous GPT (Generative Pre-trained Transformer) and Llama (Large Language Model Meta AI) excel in generating and understanding text with minimal additional training. These models leverage extensive pre-training on diverse data to perform new tasks in a few-shot or zero-shot manner. Their rich contextual understanding, ability to follow instructions, and generalization capabilities allow them to handle tasks with less dependency on large labeled datasets, making them versatile and powerful tools in the field of NLP. However, their great value comes with a cost of computational power, so they should be used with care and only when necessary.

‍

Applications of LLMs at Sentra

‍

Sentra uses LLMs for both Named Entity Recognition (NER) and document labeling tasks. The input to the models is similar, with minor adjustments, and the output varies depending on the task:

‍

Named Entity Recognition (NER): The model labels each word or sentence in the text with its correct entity (which Sentra refers to as a data class).
Document Labels: The model labels the entire text with the appropriate label (which Sentra refers to as a data context).
Continuous Automatic Analysis: Sentra uses its LLMs to continuously analyze customer data, help our analysts find potential mistakes, and to suggest new entities and document labels to be added to our classification system.

‍

*Here you can see an example of how Sentra classifies personal information.*
***Note***: Entity refers to data classes on our dashboard
Document labels refers to data context on our dashboard

‍

Sentra’s Generative LLM Inference Approaches

An inference approach in the context of machine learning involves using a trained model to make predictions or decisions based on new data. This is crucial for practical applications where we need to classify or analyze data that wasn't part of the original training set.

‍

When working with complex or unstructured data, it's crucial to have effective methods for interpreting and classifying the information. Sentra employs Generative LLMs for classifying complex or unstructured data. Sentra’s main approaches to generative LLM inference are as follows:

Supervised Trained Models (e.g., BERT)

In-house trained models are used when there is a need for high precision in recognizing domain-specific entities and sufficient relevant data is available for training. These models offer customization to capture the subtle nuances of specific datasets, enhancing accuracy for specialized entity types. These models are transformer-based deep neural networks with a “classic” fixed-size input and a well-defined output size, in contrast to generative models. Sentra uses the BERT architecture, modified and trained on our in-house labeled data, to create a model well-suited for classifying specific data types.

‍

This approach is advantageous because:

‍

In multi-category classification, where a model needs to classify an object into one of many possible categories, the model outputs a vector the size of the number of categories, n. For example, when classifying a text document into categories like ["Financial," "Sports," "Politics," "Science," "None of the above"], the output vector will be of size n=5. Each coordinate of the output vector represents one of the categories, and the model's output can be interpreted as the likelihood of the input falling into one of these categories.
The BERT model is well-designed for fine-tuning specific classification tasks. Changing or adding computation layers is straightforward and effective.
The model size is relatively small, with around 110 million parameters requiring less than 500MB of memory, making it both possible to fine-tune the model’s weights for a wide range of tasks, and more importantly - run in production at small computation costs.
It has proven state-of-the-art performance on various NLP tasks like GLUE (General Language Understanding Evaluation), and Sentra’s experience with this model shows excellent results.

Zero-Shot Classification

One of the key techniques that Sentra has recently started to utilize is zero-shot classification, which excels in interpreting and classifying data without needing pre-trained models. This approach allows Sentra to efficiently and precisely understand the contents of various documents, ensuring high accuracy in identifying sensitive information.

‍

The comprehensive understanding of English (and almost any language) enables us to classify objects customized to a customer's needs without creating a labeled data set. This not only saves time by eliminating the need for repetitive training but also proves crucial in situations where defining specific cases for detection is challenging. When handling sensitive or rare data, this zero-shot and few-shot capability is a significant advantage.

‍

Our use of zero-shot classification within LLMs significantly enhances our data analysis capabilities. By leveraging this method, we achieve an accuracy rate with a false positive rate as low as three to five percent, eliminating the need for extensive pre-training.

Sentra’s Data Sensitivity Estimation Methodologies

Accurate classification is only a (very crucial) step to determine if a document is sensitive. At the end of the day, a customer must be able to also discern whether a document contains the addresses, phone numbers or emails of the company’s offices, or the company’s clients.

Accumulated Knowledge

Sentra has developed domain expertise to predict which objects are generally considered more sensitive. For example, documents with login information are more sensitive compared to documents containing random names.

Sentra has developed the main expertise based on our collected AI analysis over time.

‍

How does Sentra accumulate the knowledge? (is it via AI/ML?)

‍

Sentra accumulates knowledge both from combining insights from our experience with current customers and their needs with machine learning models that continuously improve based on the data they are trained with over time.

Customer-Specific Needs

Sentra tailors sensitivity models to each customer’s specific needs, allowing feedback and examples to refine our models for optimal results. This customization ensures that sensitivity estimation models are precisely tuned to each customer’s requirements.

‍

What is an example of a customer-specific need?

‍

For instance, one of our customers required a particular combination of PII (personally identifiable information) and NPPI (nonpublic personal information). We tailored our solution by creating a composite classifier to meet their needs by designating documents containing these combinations as having a higher sensitivity level.

Sentra’s sensitivity assessment (that drives classification definition) can be based on detected data classes, document labels, and detection volumes, which triggers extra analysis from our system if needed.

Conclusion

In summary, Sentra’s comprehensive approach to data classification and sensitivity estimation ensures precise and adaptable handling of sensitive data, supporting robust data security at scale. With accurate, granular data classification, security teams can confidently proceed to remediation steps without need for further validation - saving time and streamlining processes. Further, accurate tags allow for automation - by sharing contextual sensitivity data with upstream controls (ex. DLP systems) and remediation workflow tools (ex. ITSM or SOAR).

‍

Additionally, our research and development teams stay abreast of the rapid advancements in Generative AI, particularly focusing on Large Language Models (LLMs). This proactive approach to data classification ensures our models not only meet but often exceed industry standards, delivering state-of-the-art performance while minimizing costs. Given the fast-evolving nature of LLMs, it is highly likely that the models we use today—BERT, GPT, Mistral, and Llama—will soon be replaced by even more advanced, yet-to-be-published technologies.

‍

<blogcta-big>

Hanan Zaichyk

Data Scientist

After earning a BSc in Mathematics and a BSc in Industrial Engineering, followed by an MSc in Computer Science with a thesis in Machine Learning theory, Hanan has spent the last five years training models for feature-based and computer vision problems. Driven by the motivation to deliver real-world value through his expertise, he leverages his strong theoretical background and hands-on experience to explore and implement new methodologies and technologies in machine learning. At Sentra, one of his main focuses is leveraging large language models (LLMs) for advanced classification and analysis tasks.

Latest Blog Posts

Dean Taler

September 16, 2025

Min Read

Compliance

How to Write an Effective Data Security Policy

Introduction: Why Writing Good Policies Matters

In modern cloud and AI-driven environments, having security policies in place is no longer enough. The quality of those policies directly shapes your ability to prevent data exposure, reduce noise, and drive meaningful response. A well-written policy helps to enforce real control and provides clarity in how to act. A poorly written one, on the other hand, fuels alert fatigue, confusion, or worse - blind spots.

‍

This article explores how to write effective, low-noise, action-oriented security policies that align with how data is actually used.

What Is a Data Security Policy?

A data security policy is a set of rules that defines how your organization handles sensitive data. It specifies who can access what information, under what conditions, and what happens when those rules are violated. But here's the key difference: a good data security policy isn't just a document that sits in a compliance folder. It's an active control that detects risky behavior and triggers specific responses. While many organizations write policies that sound impressive but create endless alerts, effective policies target real risks and drive meaningful action. The goal isn't to monitor everything, it's to catch the activities that actually matter and respond quickly when they happen.

What Makes a Data Security Policy “Good”?

Before you begin drafting, ask yourself: what problem is this policy solving, and why does it matter?

A good data security policy isn’t just a technical rule sitting in a console, it’s a sensor for meaningful risk. It should define what activity you want to detect, under what conditions it should trigger, and who or what is in scope, so that it avoids firing on safe, expected scenarios.

‍

Key characteristics of an effective policy:
‍

Clear intent: protects against a well-defined risk, not a vague category of threats.
Actionable outcome: leads to a specific, repeatable response.
Low noise: triggers only on unusual or risky patterns, not normal operations.
Context-aware: accounts for business processes and expected data use.

💡 Tip: If you can’t explain in one sentence what you want to detect and what action should happen when it triggers, your policy isn’t ready for production.

Turning Risk Into Actionable Policy

Data security policies should always be grounded in real business risk, not just what’s technically possible to monitor. A strong policy targets scenarios that could genuinely harm the organization if left unchecked.

‍

Questions to ask before creating a policy:
‍

What specific behavior poses a risk to our sensitive or regulated data?
Who might trigger it, and why? Is it more likely to be malicious, accidental, or operational?
What exceptions or edge cases should be allowed without generating noise?
What systems will enforce it and who owns the response when it fires?

Instead of vague statements like “No access to PII”, write with precision:

‍
“Block and alert on external sharing of customer PII from corporate cloud storage to any domain not on the approved partner list, unless pre-approved via the security exception process.”
‍

Recommendations:
‍

Treat policies like code - start them in monitor-only mode.
Test both sides: validate true positives (catching risky activity) and avoid false positives (triggering on normal behavior).

💡 Tip: The best policies are precise enough to detect real risks, but tested enough to avoid drowning teams in noise.

A Good Data Security Policy Should Drive Action

Policies are only valuable if they lead to a decision or action. Without a clear owner or remediation process, alerts quickly become noise. Every policy should generate an alert that leads to accountability.
‍

Questions to ask:
‍

Who owns the alert?
What should happen when it fires?
How quickly should it be resolved?

💡 Tip: If no one is responsible for acting on a policy’s alerts, it’s not a policy — it’s background noise.

Don’t Ignore the Noise

When too many alerts fire, it’s tempting to dismiss them as an annoyance. But noisy policies are often a signal, not a mistake. Sometimes policies are too broad or poorly scoped. Other times, they point to deeper systemic risks, such as overly open sharing practices or misconfigured controls.
‍

Recommendations:
‍

Investigate noisy policies before silencing them.
Treat excess alerts as a clue to systemic risk.

💡 Tip: A noisy policy may be exposing the exact weakness you most need to fix.

Know When to Adjust or Retire a Policy

Policies must evolve as your organization, tools, and data change. A rule that made sense last year might be irrelevant or counterproductive today.
‍

Recommendations:
‍

Continuously align policies with evolving risks.
Track key metrics: how often it triggers, severity, and response actions.
Optimize response paths so alerts reach the right owners quickly.
Schedule quarterly or biannual reviews with both security and business stakeholders.

💡 Tip: The only thing worse than no policy is a stale one that everyone ignores.

Why Smart Policies Matter for Regulated Data

Data security policies aren’t just an internal safeguard, they are how compliance is enforced in practice. Regulations like GDPR, HIPAA, and PCI DSS require demonstrable control over sensitive data.

Poorly written policies generate alert fatigue, making it harder to detect real violations. Well-crafted ones reduce the risk of noncompliance, streamline audits, and improve breach response.
‍

Recommendations:
‍

Map each policy directly to a specific regulatory requirement.
Retire rules that create noise without reducing actual risk.

💡 Tip: If a policy doesn’t map to a regulation or a real risk, it’s adding effort without adding value.

Making Policy Creation Simple, Powerful, and Built for Results

An effective solution for policy creation should make it easy to get started, provide the flexibility to adapt to your unique environment, and give you the deep data context you need to make policies that actually work. It should streamline the process so you can move quickly without sacrificing control, compliance, or clarity.

‍

Sentra is that solution. By combining intuitive policy building with deep data context, Sentra simplifies and strengthens the entire lifecycle of policy creation.
‍

With Sentra, you can:
‍

Start fast with out-of-the-box, low-noise controls.
Create custom policies without complexity.
Leverage real-time knowledge of where sensitive data lives and who has access to it.
Continuously tune for low noise with performance metrics.
Understand which regulations you can adhere to
‍

💡 Tip: The true value of a policy isn’t how often it triggers, it’s whether it consistently drives the right response.

‍

Good Policies Start with Good Visibility

The best data security policies are written by teams who know exactly where sensitive data lives, how it moves, who can access it, and what creates risk. Without that visibility, policy writing becomes guesswork. With it, enforcement becomes simple, effective, and sustainable.
‍

At Sentra, we believe policy creation should be driven by real data, not assumptions. If you’re ready to move from reactive alerts to meaningful control.

<blogcta-big>

‍

Nikki Ralston

Gilad Golani

September 3, 2025

Min Read

Data Loss Prevention

Supercharging DLP with Automatic Data Discovery & Classification of Sensitive Data

Data Loss Prevention (DLP) is a keystone of enterprise security, yet traditional DLP solutions continue to suffer from high rates of both false positives and false negatives, primarily because they struggle to accurately identify and classify sensitive data in cloud-first environments.

‍

New advanced data discovery and contextual classification technology directly addresses this gap, transforming DLP from an imprecise, reactive tool into a proactive, highly effective solution for preventing data loss.

Why DLP Solutions Can’t Work Alone

DLP solutions are designed to prevent sensitive or confidential data from leaving your organization, support regulatory compliance, and protect intellectual property and reputation. A noble goal indeed. Yet DLP projects are notoriously anxiety-inducing for CISOs. On the one hand, they often generate a high amount of false positives that disrupt legitimate business activities and further exacerbate alert fatigue for security teams.

‍

What’s worse than false positives? False negatives. Today traditional DLP solutions too often fail to prevent data loss because they cannot efficiently discover and classify sensitive data in dynamic, distributed, and ephemeral cloud environments.

‍

Traditional DLP faces a twofold challenge:

High False Positives: DLP tools often flag benign or irrelevant data as sensitive, overwhelming security teams with unnecessary alerts and leading to alert fatigue.
High False Negatives: Sensitive data is frequently missed due to poor or outdated classification, leaving organizations exposed to regulatory, reputational, and operational risks.

These issues stem from DLP’s reliance on basic pattern-matching, static rules, and limited context. As a result, DLP cannot keep pace with the ways organizations use, store, and share data, resulting in the dual-edged sword of both high false positives and false negatives. Furthermore, the explosion of unstructured data types and shadow IT creates blind spots that traditional DLP solutions cannot detect. As a result, DLP often can’t keep pace with the ways organizations use, store, and share data. It isn’t that DLP solutions don’t work, rather they lack the underlying discovery and classification of sensitive data needed to work correctly.

AI-Powered Data Discovery & Classification Layer

Continuous, accurate data classification is the foundation for data security. An AI-powered data discovery and classification platform can act as the intelligence layer that makes DLP work as intended. Here’s how Sentra complements the core limitations of DLP solutions:

1. Continuous, Automated Data Discovery

Comprehensive Coverage: Discovers sensitive data across all data types and locations - structured and unstructured sources, databases, file shares, code repositories, cloud storage, SaaS platforms, and more.
Cloud-Native & Agentless: Scans your entire cloud estate (AWS, Azure, GCP, Snowflake, etc.) without agents or data leaving your environment, ensuring privacy and scalability.
‍
Shadow Data Detection: Uncovers hidden or forgotten (“shadow”) data sets that legacy tools inevitably miss, providing a truly complete data inventory.

2. Contextual, Accurate Classification

AI-Driven Precision: Sentra proprietary LLMs and hybrid models achieve over 95% classification accuracy, drastically reducing both false positives and false negatives.
Contextual Awareness: Sentra goes beyond simple pattern-matching to truly understand business context, data lineage, sensitivity, and usage, ensuring only truly sensitive data is flagged for DLP action.
‍
Custom Classifiers: Enables organizations to tailor classification to their unique business needs, including proprietary identifiers and nuanced data types, for maximum relevance.

3. Real-Time, Actionable Insights

Sensitivity Tagging: Automatically tags and labels files with rich metadata, which can be fed directly into your DLP for more granular, context-aware policy enforcement.
API Integrations: Seamlessly integrates with existing DLP, IR, ITSM, IAM, and compliance tools, enhancing their effectiveness without disrupting existing workflows.
Continuous Monitoring: Provides ongoing visibility and risk assessment, so your DLP is always working with the latest, most accurate data map.

How Sentra Supercharges DLP Solutions

Better Classification Means Less Noise, More Protection

Reduce Alert Fatigue: Security teams focus on real threats, not chasing false alarms, which results in better resource allocation and faster response times.
Accelerate Remediation: Context-rich alerts enable faster, more effective incident response, minimizing the window of exposure.
Regulatory Compliance: Accurate classification supports GDPR, PCI DSS, CCPA, HIPAA, and more, reducing audit risk and ensuring ongoing compliance.
Protect IP and Reputation: Discover and secure proprietary data, customer information, and business-critical assets, safeguarding your organization’s most valuable resources.

Why Sentra Outperforms Legacy Approaches

Sentra’s hybrid classification framework combines rule-based systems for structured data with advanced LLMs and zero-shot learning for unstructured and novel data types.

‍

This versatility ensures:

‍

Scalability: Handles petabytes of data across hybrid and multi-cloud environments, adapting as your data landscape evolves.
Adaptability: Learns and evolves with your business, automatically updating classifications as data and usage patterns change.
Privacy: All scanning occurs within your environment - no data ever leaves your control, ensuring compliance with even the strictest data residency requirements.

Use Case: Where DLP Alone Fails, Sentra Prevails

A financial services company uses a leading DLP solution to monitor and prevent the unauthorized sharing of sensitive client information, such as account numbers and tax IDs, across cloud storage and email. The DLP is configured with pattern-matching rules and regular expressions for identifying sensitive data.

‍

What Goes Wrong:

‍
An employee uploads a spreadsheet to a shared cloud folder. The spreadsheet contains a mix of client names, account numbers, and internal project notes. However, the account numbers are stored in a non-standard format (e.g., with dashes, spaces, or embedded within other text), and the file is labeled with a generic name like “Q2_Projects.xlsx.” The DLP solution, relying on static patterns and file names, fails to recognize the sensitive data and allows the file to be shared externally. The incident goes undetected until a client reports a data breach.

‍

How Sentra Solves the Problem:

‍
To address this, the security team set out to find a solution capable of discovering and classifying unstructured data without creating more overhead. They selected Sentra for its autonomous ability to continuously discover and classify all types of data across their hybrid cloud environment. Once deployed, Sentra immediately recognizes the context and content of files like the spreadsheet that enabled the data leak. It accurately identifies the embedded account numbers—even in non-standard formats—and tags the file as highly sensitive.

‍

This sensitivity tag is automatically fed into the DLP, which then successfully enforces strict sharing controls and alerts the security team before any external sharing can occur. As a result, all sensitive data is correctly classified and protected, the rate of false negatives was dramatically reduced, and the organization avoids further compliance violations and reputational harm.

Getting Started with Sentra is Easy

Deploy Agentlessly: No complex installation. Sentra integrates quickly and securely into your environment, minimizing disruption.
Automate Discovery & Classification: Build a living, accurate inventory of your sensitive data assets, continuously updated as your data landscape changes.
Enhance DLP Policies: Feed precise, context-rich sensitivity tags into your DLP for smarter, more effective enforcement across all channels.
Monitor Continuously: Stay ahead of new risks with ongoing discovery, classification, and risk assessment, ensuring your data is always protected.

“Sentra’s contextual classification engine turns DLP from a reactive compliance checkbox into a proactive, business-enabling security platform.”

‍

Fuel DLP with Automatic Discovery & Classification

DLP is an essential data protection tool, but without accurate, context-aware data discovery and classification, it’s incomplete and often ineffective. Sentra supercharges your DLP with continuous data discovery and accurate classification, ensuring you find and protect what matters most—while eliminating noise, inefficiency, and risk.

‍

Ready to see how Sentra can supercharge your DLP? Contact us for a demo today.

<blogcta-big>

‍

Veronica Marinov

Romi Minin

May 15, 2025

Min Read

AI and ML

Ghosts in the Model: Uncovering Generative AI Risks

Generative AI risks are no longer hypothetical. They’re shaping the way enterprises think about cloud security. As artificial intelligence (AI) becomes deeply integrated into enterprise workflows, organizations are increasingly leveraging cloud-based AI services to enhance efficiency and decision-making.

‍

In 2024, 56% of organizations adopted AI to develop custom applications, with 39% of Azure users leveraging Azure OpenAI services. However, with rapid AI adoption in cloud environments, security risks are escalating. As AI continues to shape business operations, the security and privacy risks associated with cloud-based AI services must not be overlooked. Understanding these risks (and how to mitigate them) is essential for organizations looking to protect their proprietary models and sensitive data.

‍Types of Generative AI Risks in Cloud Environments

When discussing AI services in cloud environments, there are two primary types of services that introduce different types of security and privacy risks. This article dives into these risks and explores best practices to mitigate them, ensuring organizations can leverage AI securely and effectively.

1. Data Exposure and Access Risks in Generative AI Platforms

Examples include OpenAI, Google, Meta, and Microsoft, which develop large-scale AI models and provide AI-related services, such as Azure OpenAI, Amazon Bedrock, Google’s Bard, Microsoft Copilot Studio. These services allow organizations to build AI Agents and GenAI services that are designed to help users perform tasks more efficiently by integrating with existing tools and platforms. For instance, Microsoft Copilot can provide writing suggestions, summarize documents, or offer insights within platforms like Word or Excel, though securing regulated data in Microsoft 365 Copilot requires specific security considerations..

What is RAG (Retrieval-Augmented Generation)?

Many AI systems use Retrieval-Augmented Generation (RAG) to improve accuracy. Instead of solely relying on a model’s pre-trained knowledge, RAG allows the system to fetch relevant data from external sources, such as a vector database, using algorithms like k-nearest neighbor. This retrieved information is then incorporated into the model’s response.

‍

When used in enterprise AI applications, RAG enables AI agents to provide contextually relevant responses. However, it also introduces a risk - if access controls are too broad, users may inadvertently gain access to sensitive corporate data.

How Does RAG (Retrieval-Augmented Generation) Apply to AI Agents?

In AI agents, RAG is typically used to enhance responses by retrieving relevant information from a predefined knowledge base.

‍

Example: In AWS Bedrock, you can define a serverless vector database in OpenSearch as a knowledge base for a custom AI agent. This setup allows the agent to retrieve and incorporate relevant context dynamically, effectively implementing RAG.

Generative AI Risks and Security Threats of AI Platforms

Custom generative AI applications, such as AI agents or enterprise-built copilots, are often integrated with organizational knowledge bases like Amazon S3, SharePoint, Google Drive, and other data sources. While these models are typically not directly trained on sensitive corporate data, the fact that they can access these sources creates significant security risks.

One potential generative AI risk is data exposure through prompts, but this only arises under certain conditions. If access controls aren’t properly configured, users interacting with AI agents might unintentionally or maliciously - prompt the model to retrieve confidential or private information.This isn’t limited to cleverly crafted prompts; it reflects a broader issue of improper access control and governance.

Configuration and Access Control Risks

The configuration of the AI agent is a critical factor. If an agent is granted overly broad access to enterprise data without proper role-based restrictions, it can return sensitive information to users who lack the necessary permissions. For instance, a model connected to an S3 bucket with sensitive customer data could expose that data if permissions aren’t tightly controlled. Simple misconfigurations can lead to serious data exposure incidents, even in applications designed for security.

‍

A common scenario might involve an AI agent designed for Sales that has access to personally identifiable information (PII) or customer records. If the agent is not properly restricted, it could be queried by employees outside of Sales, such as developers - who should not have access to that data.

Example Generative AI Risk Scenario

An employee asks a Copilot-like agent to summarize company-wide sales data. The AI returns not just high-level figures, but also sensitive customer or financial details that were unintentionally exposed due to lax access controls.

Challenges in Mitigating Generative AI Risks

The core challenge, particularly relevant to platforms like Sentra, is enforcing governance to ensure only appropriate data is used and accessible by AI services.

‍

This includes:

Defining and enforcing granular data access controls.
Preventing misconfigurations or overly permissive settings.
Maintaining real-time visibility into which data sources are connected to AI models.
Continuously auditing data flows and access patterns to prevent leaks.

Without rigorous governance and monitoring, even well-intentioned GenAI implementations can lead to serious data security incidents.

2. ML and AI Studios for Building New Models

Many companies, such as large financial institutions, build their own AI and ML models to make better business decisions, or to improve their user experiences. Unlike large foundational models from major tech companies, these custom AI models are trained by the organization itself on their applications or corporate data.
‍

Security Risks of Custom AI Models

Weak Data Governance Policies - If data governance policies are inadequate, sensitive information, such as customers' Personally Identifiable Information (PII), could be improperly accessed or shared during the training process. This can lead to data breaches, privacy compliance violations, and unethical AI usage. The growing recognition of generative AI-related risks has driven the development of more AI compliance frameworks that are now being actively enforced with significant penalties..
‍
Excessive Access to Training Data and AI Models - Granting unrestricted access to training datasets and machine learning (ML)/AI models increases the risk of data leaks and misuse. Without proper access controls, sensitive data used in training can be exposed to unauthorized individuals, leading to compliance and security concerns.
‍
AI Agents Exposing Sensitive Data - AI agents that do not have proper safeguards can inadvertently expose sensitive information to a broad audience within an organization. For example, an employee could retrieve confidential data such as the CEO’s salary or employment contracts if access controls are not properly enforced.
‍
Insecure Model Storage – Once a model is trained, it is typically stored in the same environment (e.g., in Amazon SageMaker, the training job stores the trained model in S3). If not properly secured, proprietary models could be exposed to unauthorized access, leading to risks such as model theft.
‍
Deployment Vulnerabilities – A lack of proper access controls can result in unauthorized use of AI models. Organizations need to assess who has access: Is the model public? Can external entities interact with or exploit it?

Shadow AI and Forgotten Assets – AI models or artifacts that are not actively monitored or properly decommissioned can become a security risk. These overlooked assets can serve as attack vectors if discovered by malicious actors.

Example Risk Scenario

A bank develops an AI-powered feature that predicts a customer’s likelihood of repaying a loan based on inputs like financial history, employment status, and other behavioral indicators. While this feature is designed to enhance decision-making and customer experience, it introduces significant generative AI risk if not properly governed.
‍

During development and training, the model may be exposed to personally identifiable information (PII), such as names, addresses, social security numbers, or account details, which is not necessary for the model’s predictive purpose.
‍

⚠️ Best practice: Models should be trained only on the minimum necessary data required for performance, excluding direct identifiers unless absolutely essential. This reduces both privacy risk and regulatory exposure.

If the training pipeline fails to properly separate or mask this PII, the model could unintentionally leak sensitive information. For example, when responding to an end-user query, the AI might reference or infer details from another individual’s record - disclosing sensitive customer data without authorization.

‍

This kind of data leakage, caused by poor data handling or weak governance during training, can lead to serious regulatory non-compliance, including violations of GDPR, CCPA, or other privacy frameworks.

Common Risk Mitigation Strategies and Their Limitations

Many organizations attempt to manage generative AI-related risks through employee training and awareness programs. Employees are taught best practices for handling sensitive data and using AI tools responsibly.
While valuable, this approach has clear limitations:
‍

Training Alone Is Insufficient:
Human error remains a major risk factor, even with proper training. Employees may unintentionally connect sensitive data sources to AI models or misuse AI-generated outputs.
Lack of Automated Oversight:
Most organizations lack robust, automated systems to continuously monitor how AI models use data and to enforce real-time security policies. Manual review processes are often too slow and incomplete to catch complex data access risks in dynamic, cloud-based AI environments.
‍
‍Policy Gaps and Visibility Challenges:
Organizations often operate with multiple overlapping data layers and services. Without clear, enforceable policies, especially automated ones - certain data assets may remain unscanned or unprotected, creating blind spots and increasing risk.

Reducing AI Risks with Sentra’s Comprehensive Data Security Platform

Managing generative AI risks in the cloud requires more than employee training.
Organizations need to adopt robust data governance frameworks and data security platforms (like Sentra’s) that address the unique challenges of AI.

This includes:

Discovering AI Assets: Automatically identify AI agents, knowledge bases, datasets, and models across the environment.
Classifying Sensitive Data: Use automated classification and tagging to detect and label sensitive information accurately.
Monitoring AI Data Access: Detect which AI agents and models are accessing sensitive data, or using it for training - in real time.
Enforcing Access Governance: Govern AI integrations with knowledge bases by role, data sensitivity, location, and usage to ensure only authorized users can access training data, models, and artifacts.
Automating Data Protection: Apply masking, encryption, access controls, and other protection methods through automated remediation capabilities across data and AI artifacts used in training and inference processes.

By combining strong technical controls with ongoing employee training, organizations can significantly reduce the risks associated with AI services and ensure compliance with evolving data privacy regulations.

<blogcta-big>

‍

Expert Data Security Insights Straight to Your Inbox

What Should I Do Now:

Get the latest GigaOm DSPM Radar report - see why Sentra was named a Leader and Fast Mover in data security.Download now and stay ahead on securing sensitive data.

Sign up for a demo and learn how Sentra’s data security platform can uncover hidden risks, simplify compliance, and safeguard your sensitive data.

Follow us on LinkedIn, X (Twitter), and YouTube for actionable expert insights on how to strengthen your data security, build a successful DSPM program, and more!

How Sentra Accurately Classifies Sensitive Data at Scale

Background on Classifying Different Types of Data

What Is Structured Data?

What Is Unstructured Data?

Data Classification Methods & Models

Rule-Based Systems

Large Language Models (LLMs)

Applications of LLMs at Sentra

Sentra’s Generative LLM Inference Approaches

Supervised Trained Models (e.g., BERT)

Zero-Shot Classification

Sentra’s Data Sensitivity Estimation Methodologies

Accumulated Knowledge

How does Sentra accumulate the knowledge? (is it via AI/ML?)

Customer-Specific Needs

What is an example of a customer-specific need?

Conclusion

Latest Blog Posts

How to Write an Effective Data Security Policy

How to Write an Effective Data Security Policy

Introduction: Why Writing Good Policies Matters

What Is a Data Security Policy?

What Makes a Data Security Policy “Good”?

Key characteristics of an effective policy:‍

Turning Risk Into Actionable Policy

Questions to ask before creating a policy:‍

Recommendations:‍

A Good Data Security Policy Should Drive Action

Questions to ask:‍

Don’t Ignore the Noise

Recommendations:‍

Know When to Adjust or Retire a Policy

Recommendations:‍

Why Smart Policies Matter for Regulated Data

Recommendations:‍

Making Policy Creation Simple, Powerful, and Built for Results

With Sentra, you can:‍

Good Policies Start with Good Visibility

Supercharging DLP with Automatic Data Discovery & Classification of Sensitive Data

Supercharging DLP with Automatic Data Discovery & Classification of Sensitive Data

Why DLP Solutions Can’t Work Alone

AI-Powered Data Discovery & Classification Layer

1. Continuous, Automated Data Discovery

2. Contextual, Accurate Classification

3. Real-Time, Actionable Insights

How Sentra Supercharges DLP Solutions

Better Classification Means Less Noise, More Protection

Why Sentra Outperforms Legacy Approaches

Use Case: Where DLP Alone Fails, Sentra Prevails

What Goes Wrong:

How Sentra Solves the Problem:

Getting Started with Sentra is Easy

Fuel DLP with Automatic Discovery & Classification

Ghosts in the Model: Uncovering Generative AI Risks

Ghosts in the Model: Uncovering Generative AI Risks

‍Types of Generative AI Risks in Cloud Environments

1. Data Exposure and Access Risks in Generative AI Platforms

What is RAG (Retrieval-Augmented Generation)?

How Does RAG (Retrieval-Augmented Generation) Apply to AI Agents?

Generative AI Risks and Security Threats of AI Platforms

Configuration and Access Control Risks

Example Generative AI Risk Scenario

Challenges in Mitigating Generative AI Risks

2. ML and AI Studios for Building New Models

Security Risks of Custom AI Models

Example Risk Scenario

Common Risk Mitigation Strategies and Their Limitations

Reducing AI Risks with Sentra’s Comprehensive Data Security Platform

Key characteristics of an effective policy:
‍

Questions to ask before creating a policy:
‍

Recommendations:
‍

Questions to ask:
‍

Recommendations:
‍

Recommendations:
‍

Recommendations:
‍

With Sentra, you can:
‍