Advancing Sensitive Data Classification in the Age of AI

Carmine Clementelli

Aug 23, 2024

Advancing Sensitive Data Classification in the Age of AI

Traditional Methods and Their Limitations

Sensitive data detection and classification have long been the cornerstones of effective data security solutions. This process identifies and categorizes sensitive information across an organization’s digital landscape in an automatic fashion, enabling businesses to protect what matters most. However, traditional methods—relying on static detection algorithms like regex-based data identifiers—often fall short, leading to inaccuracies, context-less results, and high volumes of false positives. These false positives disrupt business operations and overwhelm incident response teams, forcing them to manually differentiate between legitimate policy violations and benign activities.

Other, more accurate methods, like Exact Data Matching (EDM), are too resource-intensive, requiring significant time and computing power to fingerprint databases and large files. As a result, they are often avoided, such as for endpoint data discovery.

Legacy data protection solutions, like traditional data loss prevention (DLP) and first-generation data security posture management (DSPM), lack the adaptability required to accurately assess the sensitivity of data in context. Human analysts can naturally interpret data with high precision by considering the full context—something static, rule-based systems struggle to achieve. As a result, these traditional methods require continuous manual tuning and are often too rigid to keep pace with the dynamic nature of modern data and collaboration practices.

A New Era: Leveraging AI and LLMs for Data Classification

Enter AI and Large Language Models (LLMs). These advanced technologies enable a quantum leap in sensitive data detection and classification. While initial concerns around data privacy and the use of AI models were valid, innovations in secure, private AI implementations have alleviated these fears.

Cyera Enhances Data Classification

Cyera leverages traditional data detection methods for quick and easy recognition of sensitive data, using common data identifiers, natural expressions and rich contextual information around data and files. But it doesn’t stop there. Cyera augments traditional detection methods with advanced data-centric AI and LLMs to offer a robust, accurate, and context-aware data classification solution. Cyera handles structured, unstructured, and semi-structured data types.

Here’s how Cyera’s approach works:

Data Scanning and Sampling
Cyera scans data stored in a wide range of cloud and on-premises environments. For structured data, Cyera clones a snapshot of the database locally. For unstructured data, Cyera clusters similar files via Machine Learning (ML) and uses small samples of the cluster to obtain a meaningful and diversified data set, which accurately reflects the customer environment while maximizing classification speed and accuracy. During this process, Cyera identifies sensitive data, analyzes metadata, and gathers context, such as the data’s owner, location, and level of sensitivity. This enhances scanning speed, overcoming the limitations of traditional data discovery methods.
AI-Powered Classification
Leveraging proprietary and contained AI models, Cyera classifies data with a remarkable 95% precision. The system also auto-learns from each customer’s unique environment, identifying never seen before patterns and data types that traditional methods would miss, even across different geography contexts and languages.
Contextual Enrichment
Beyond mere classification, Cyera enriches data by identifying contextual factors like data subject roles, geo-locations, and the specific sensitivity levels of different data types. This nuanced understanding allows Cyera to apply the appropriate security measures without over-protecting non-sensitive data.
Privacy and Security
Cyera’s AI models are developed in-house and trained securely, ensuring that customer data remains private and isolated. The models are optimized for each environment, providing high precision without risking data leakage or spillage.

Image: 3 types of AI/ML data classification

‍

How It All Comes Together: Cyera’s AI and LLM Data Classification Models in Action

Cyera’s AI-driven data classification is engineered for exceptional accuracy in identifying and classifying sensitive data. Developed in-house, Cyera’s AI and Large Language Models (LLMs) leverage open-source foundation models like FLAN T5 and Mistral, which are significantly enhanced through Cyera’s proprietary training processes. The models are trained and fine-tuned using extensive datasets and optimized with hyperparameters, all within Cyera’s secure environment, ensuring they remain isolated from external exposures.

The true strength of Cyera’s models lies in their ability to auto-learn and adapt to customer-specific data. They can learn to recognize unique data formats, such as customer-specific employee IDs, product SKUs, and claim numbers, continuously refining their classification capabilities to accurately identify and classify even the most nuanced data types.

As mentioned earlier in this blog, Cyera’s system also incorporates data enrichment, adding contextual layers to classifications by evaluating factors like data subject roles, geographic locations, and data-level protections, ensuring that data sensitivity is assessed within the proper context.

Privacy and security are paramount in Cyera’s processes. The AI models primarily utilize public datasets for training and are enriched by selectively incorporating minimal, protected data samples from the customer environment for further training. While the AI models can be trained using minimal amounts of customer data, this is done securely, ensuring the data is embedded, irreversible, and segregated to prevent any exposure, fundamentally maintaining strict data privacy standards. Customers can also opt-out of data usage without compromising the quality of service.

Our AI models for data classification are proprietary to Cyera. We do not communicate with any public generative AI systems. Instead, we leverage advancements in this field through our own researchers, who track generative AI capabilities and ensure that the value we provide with our models remains competitive and innovative.

Additional capabilities that set Cyera apart

Comprehensive Support for Modern Data Types

Cyera’s solution supports a wide array of file types—structured, semi-structured, and unstructured—across any environment, whether it’s SaaS, IaaS, PaaS, or on-premises. This broad coverage ensures that no data is left unclassified, regardless of format or location.

Identity Access Insights

In addition to classification, Cyera provides insights into who or what has access to sensitive data. It automatically assigns trust levels to both human and non-human identities, helping organizations enforce Zero Trust policies and prevent unauthorized access.

Conclusion: The Future of Data Security

As data sprawl continues to grow, the need for advanced, accurate, and context-aware data classification becomes more critical than ever. By integrating AI and LLMs, Cyera offers a solution that not only enhances data protection, privacy and compliance but also supports business agility by dramatically reducing false positives and ensuring a lean and stress-free incident response process. In the age of AI, Cyera is leading the charge in redefining how sensitive data is detected, classified, and protected.

‍Request a demo of Cyera

Topic