Hanan Zaichyk
After earning a BSc in Mathematics and a BSc in Industrial Engineering, followed by an MSc in Computer Science with a thesis in Machine Learning theory, Hanan has spent the last five years training models for feature-based and computer vision problems. Driven by the motivation to deliver real-world value through his expertise, he leverages his strong theoretical background and hands-on experience to explore and implement new methodologies and technologies in machine learning. At Sentra, one of his main focuses is leveraging large language models (LLMs) for advanced classification and analysis tasks.
Name's Data Security Posts
Building a Better DSPM by Combining Data Classification Techniques
Building a Better DSPM by Combining Data Classification Techniques
The increasing prevalence of data breaches is driving many organizations to add another tool to their ever growing security arsenal - data security posture management, or DSPM.
This new approach recognizes that not all data is equal - breaches to some data can have dire implications for an organization, while breaches to other data can be very alarming but will not cause major financial or reputational damage.
At Sentra, we’re building the product to fulfill this approach's potential by mapping all data in organizations’ cloud environments, determining where sensitive data is stored, and who has access to it. Because some data is more sensitive than others, accurate classification of data is the core of a successful DSPM solution.
Unfortunately, there’s no single approach that can cover all data classes optimally, so we need to employ a number of classification techniques, scanning methods, verification processes, and advanced statistical analysis. By combining the strengths and weaknesses of different techniques, we can reach a high level of accuracy for all types of sensitive cloud data.
Let’s dive into some of these techniques to see how different methods can be used in different situations to achieve superior results.
The Power and Limits of Regular Expressions
Regular expressions are a very robust tool that can precisely capture a wide range of data entities at scale. Regular expressions capture a pattern - specific character types, their order, and lengths. Using regular expressions for classification involves looking at a string of characters - without any context - and deducing what entity it represents based only on the pattern of the string. A couple of examples where this can be used effectively are AWS keys and IP addresses. We know how many characters and what type of characters these entities contain.
However, the limitation of this approach is that if the pattern of the characters isn’t sufficient to classify the entity, a regular expression will need ‘help’. For example, a 9 digit number can represent a number of things, but if it is on a driver’s license it’s probably a license number, if it’s on a tax return, it’s probably a Social Security Number, etc.
Humans do this subconsciously all the time. If you hear someone’s name is ‘George’ you know that’s a common first name, and you will assume - usually correctly - that the individual’s first name is ‘George’ and not his last name.
So what we need is a classification engine that can make these connections the way humans do - one that can look at the context of the string, and not just its content. Discovery and classification of sensitive data is one of many DSPM use cases we employ to secure your data. We also can give the engine a list of names and tell it “these are first names” so that it’s able to accurately make these connections.
Another method to provide context is NER - Named Entity Recognition. This is a tool from Natural Language Processing (NLP) which can analyze sentences to determine the category of different words. Supplementing the limitations of regular expressions with these techniques is one way Sentra ensures that we’re always using the best possible classification technique.
Of course, we still need to ensure that these patterns or data entities are actually the ones we’re looking for. For example, let’s say we identify a 16 digit number. This could be a credit card number. But it could also be a user ID, bank account number, a tracking number, or just a very large number.
So how do we determine if this is, in fact, a credit card number?
There are a number of ways we can confirm this.
(Note that these approaches are using the example of the credit card, but this can extend to various data classes):
- Verify the integrity of the results: Credit cards have a check digit, the last digit in any card number, designed to avoid typos. We can verify it is correct. We can also verify the first few digits are in the ranges allowed for credit cards.
- Model internal structure of the data: If data is in tabular form, such as a .csv file, we can create models of relationships between column values, so that only if, for example, 50% of values are valid credit card numbers will the whole column be labeled as such.
- Look at the data’s ‘detection context’: If data is in tabular form, such as a .csv file, we can increase our certainty of a credit card detection if the column is named “credit card number”. The relationships between different columns can be used to add missing context, so a column suspected to hold credit card numbers will seem much more probable if there’s an expiration date column and a CVV column in the same table. When the data is in free form text format (as in a .docx file) this is much more complicated, and tools such as natural language understanding and keywords must be applied to accurately classify the data.
These are a few examples of methods that when combined together appropriately can yield results that are not only much more accurate, but also much more useful for explaining and understanding the reasoning behind these decisions.
Data classification has long been a challenge because of the limitations in different models. Only by using different methods in conjunction are we able to classify with the level of accuracy required to assist data and security teams responsible for securing large quantities of cloud data.
To learn more about our approach to cloud data security, watch the demo of the Sentra platform here.
How Sentra Accurately Classifies Sensitive Data at Scale
How Sentra Accurately Classifies Sensitive Data at Scale
Background on Classifying Different Types of Data
It’s first helpful to review the primary types of data we need to classify - Structured and Unstructured Data and some of the historical challenges associated with analyzing and accurately classifying it.
What Is Structured Data?
Structured data has a standardized format that makes it easily accessible for both software and humans. Typically organized in tables with rows and/or columns, structured data allows for efficient data processing and insights. For instance, a customer data table with columns for name, address, customer-ID and phone number can quickly reveal the total number of customers and their most common localities.
Moreover, it is easier to conclude that the number under the phone number column is a phone number, while the number under the ID is a customer-ID. This contrasts with unstructured data, in which the context of each word is not straightforward.
What Is Unstructured Data?
Unstructured data, on the other hand, refers to information that is not organized according to a preset model or schema, making it unsuitable for traditional relational databases (RDBMS). This type of data constitutes over 80% of all enterprise data, and 95% of businesses prioritize its management. The volume of unstructured data is growing rapidly, outpacing the growth rate of structured databases.
Examples of unstructured data include:
- Various business documents
- Text and multimedia files
- Email messages
- Videos and photos
- Webpages
- Audio files
While unstructured data stores contain valuable information that often is essential to the business and can guide business decisions, unstructured data classification has historically been challenging. However, AI and machine learning have led to better methods to understand the data content and uncover embedded sensitive data within them.
The division to structured and unstructured is not always a clear cut. For example, an unstructured object like a docx document can contain a table, while each structured data table can contain cells with a lot of text which on its own is unstructured. Moreover there are cases of semi-structured data. All of these considerations are part of the Sentra classification system and beyond the scope of this blog.
Data Classification Methods & Models
Applying the right data classification method is crucial for achieving optimal performance and meeting specific business needs. Sentra employs a versatile decision framework that automatically leverages different classification models depending on the nature of the data and the requirements of the task.
We utilize two primary approaches:
- Rule-Based Systems
- Large Language Models (LLMs)
Rule-Based Systems
Rule-based systems are employed when the data contains entities that follow specific, predictable patterns, such as email addresses or checksum-validated numbers. This method is advantageous due to its fast computation, deterministic outcomes, and simplicity, often providing the most accurate results for well-defined scenarios.
Due to their simplicity, efficiency, and deterministic nature, Sentra uses rule-based models whenever possible for data classification. These models are particularly effective in structured data environments, which possess invaluable characteristics such as inherent structure and repetitiveness.
For instance, a table named "Transactions" with a column labeled "Credit Card Number" allows for straightforward logic to achieve high accuracy in determining that the document contains credit card numbers. Similarly, the uniformity in column values can help classify a column named "Abbreviations" as 'Country Name Abbreviations' if all values correspond to country codes.
Sentra also uses rule-based labeling for document and entity detection in simple cases, where document properties provide enough information. Customer-specific rules and simple patterns with strong correlations to certain labels are also handled efficiently by rule-based models.
Large Language Models (LLMs)
Large Language Models (LLMs) such as BERT, GPT, and LLaMa represent significant advancements in natural language processing, each with distinct strengths and applications. BERT (Bidirectional Encoder Representations from Transformers) is designed for fine-grained understanding of text by processing it bidirectionally, making it highly effective for tasks like Named Entity Recognition (NER) when trained on large, labeled datasets. In contrast, autoregressive models like the famous GPT (Generative Pre-trained Transformer) and Llama (Large Language Model Meta AI) excel in generating and understanding text with minimal additional training. These models leverage extensive pre-training on diverse data to perform new tasks in a few-shot or zero-shot manner. Their rich contextual understanding, ability to follow instructions, and generalization capabilities allow them to handle tasks with less dependency on large labeled datasets, making them versatile and powerful tools in the field of NLP. However, their great value comes with a cost of computational power, so they should be used with care and only when necessary.
Applications of LLMs at Sentra
Sentra uses LLMs for both Named Entity Recognition (NER) and document labeling tasks. The input to the models is similar, with minor adjustments, and the output varies depending on the task:
- Named Entity Recognition (NER): The model labels each word or sentence in the text with its correct entity (which Sentra refers to as a data class).
- Document Labels: The model labels the entire text with the appropriate label (which Sentra refers to as a data context).
- Continuous Automatic Analysis: Sentra uses its LLMs to continuously analyze customer data, help our analysts find potential mistakes, and to suggest new entities and document labels to be added to our classification system.
Sentra’s Generative LLM Inference Approaches
An inference approach in the context of machine learning involves using a trained model to make predictions or decisions based on new data. This is crucial for practical applications where we need to classify or analyze data that wasn't part of the original training set.
When working with complex or unstructured data, it's crucial to have effective methods for interpreting and classifying the information. Sentra employs Generative LLMs for classifying complex or unstructured data. Sentra’s main approaches to generative LLM inference are as follows:
Supervised Trained Models (e.g., BERT)
In-house trained models are used when there is a need for high precision in recognizing domain-specific entities and sufficient relevant data is available for training. These models offer customization to capture the subtle nuances of specific datasets, enhancing accuracy for specialized entity types. These models are transformer-based deep neural networks with a “classic” fixed-size input and a well-defined output size, in contrast to generative models. Sentra uses the BERT architecture, modified and trained on our in-house labeled data, to create a model well-suited for classifying specific data types.
This approach is advantageous because:
- In multi-category classification, where a model needs to classify an object into one of many possible categories, the model outputs a vector the size of the number of categories, n. For example, when classifying a text document into categories like ["Financial," "Sports," "Politics," "Science," "None of the above"], the output vector will be of size n=5. Each coordinate of the output vector represents one of the categories, and the model's output can be interpreted as the likelihood of the input falling into one of these categories.
- The BERT model is well-designed for fine-tuning specific classification tasks. Changing or adding computation layers is straightforward and effective.
- The model size is relatively small, with around 110 million parameters requiring less than 500MB of memory, making it both possible to fine-tune the model’s weights for a wide range of tasks, and more importantly - run in production at small computation costs.
- It has proven state-of-the-art performance on various NLP tasks like GLUE (General Language Understanding Evaluation), and Sentra’s experience with this model shows excellent results.
Zero-Shot Classification
One of the key techniques that Sentra has recently started to utilize is zero-shot classification, which excels in interpreting and classifying data without needing pre-trained models. This approach allows Sentra to efficiently and precisely understand the contents of various documents, ensuring high accuracy in identifying sensitive information. The comprehensive understanding of English (and almost any language) enables us to classify objects customized to a customer's needs without creating a labeled data set. This not only saves time by eliminating the need for repetitive training but also proves crucial in situations where defining specific cases for detection is challenging. When handling sensitive or rare data, this zero-shot and few-shot capability is a significant advantage.
Our use of zero-shot classification within LLMs significantly enhances our data analysis capabilities. By leveraging this method, we achieve an accuracy rate with a false positive rate as low as three to five percent, eliminating the need for extensive pre-training.
Sentra’s Data Sensitivity Estimation Methodologies
Accurate classification is only a (very crucial) step to determine if a document is sensitive. At the end of the day, a customer must be able to also discern whether a document contains the addresses, phone numbers or emails of the company’s offices, or the company’s clients.
Accumulated Knowledge
Sentra has developed domain expertise to predict which objects are generally considered more sensitive. For example, documents with login information are more sensitive compared to documents containing random names.
Sentra has developed the main expertise based on our collected AI analysis over time.
How does Sentra accumulate the knowledge? (is it via AI/ML?)
Sentra accumulates knowledge both from combining insights from our experience with current customers and their needs with machine learning models that continuously improve based on the data they are trained with over time.
Customer-Specific Needs
Sentra tailors sensitivity models to each customer’s specific needs, allowing feedback and examples to refine our models for optimal results. This customization ensures that sensitivity estimation models are precisely tuned to each customer’s requirements.
What is an example of a customer-specific need?
For instance, one of our customers required a particular combination of PII (personally identifiable information) and NPPI (nonpublic personal information). We tailored our solution by creating a composite classifier to meet their needs by designating documents containing these combinations as having a higher sensitivity level.
Sentra’s sensitivity assessment (that drives classification definition) can be based on detected data classes, document labels, and detection volumes, which triggers extra analysis from our system if needed.
Conclusion
In summary, Sentra’s comprehensive approach to data classification and sensitivity estimation ensures precise and adaptable handling of sensitive data, supporting robust data security at scale. With accurate, granular data classification, security teams can confidently proceed to remediation steps without need for further validation - saving time and streamlining processes. Further, accurate tags allow for automation - by sharing contextual sensitivity data with upstream controls (ex. DLP systems) and remediation workflow tools (ex. ITSM or SOAR).
Additionally, our research and development teams stay abreast of the rapid advancements in Generative AI, particularly focusing on Large Language Models (LLMs). This proactive approach to data classification ensures our models not only meet but often exceed industry standards, delivering state-of-the-art performance while minimizing costs. Given the fast-evolving nature of LLMs, it is highly likely that the models we use today—BERT, GPT, Mistral, and Llama—will soon be replaced by even more advanced, yet-to-be-published technologies.