Sensitive Data Discovery and Classification

Sensitive data discovery and classification is a process used to identify and categorize sensitive or confidential information within an organization's digital assets. This information can include [personally identifiable information PII, payment card information (PCI), financial data, healthcare records, intellectual property, trade secrets, and other types of sensitive information that need to be protected from unauthorized access or disclosure.

Forrester defines data discovery and classification as, "The ability to provide visibility into where sensitive data is located; identify what the sensitive data is and why it’s considered sensitive; and tag or label data based on its level of sensitivity. Sensitive data discovery and classification is valuable in that it identifies what you must protect and facilitates the next step of enabling data security controls. Organizations use this visibility into and understanding of data to optimize data use and handling policies and identify appropriate security, privacy, and data governance controls. They may automate remediation capabilities to protect the data and surface insights that inform policy, data handling, and data lifecycle decisions."

According to Gartner, "Data discovery solutions discover, analyze and classify structured and unstructured data to create actionable outcomes for security enforcement and data life cycle management. Using elements of metadata, content and contextual information, combined with expression- and machine-learning-based data models, data discovery solutions provide actionable guidance and processes to advance data management and security initiatives."

The process of discovering and classifying data is crucial for maintaining data security, privacy, and compliance. By identifying and categorizing sensitive information, organizations can take appropriate measures to protect it, reduce the risk of data breaches, and maintain trust with customers, partners, and regulatory bodies. Automated tools and technologies are often employed to streamline and enhance the efficiency of this process, given the vast amounts of data that organizations generate and store.

In this piece, you'll get an overview of sensitive data discovery and classification, including what it is, how it came about, and how it's typically carried out. We'll identify some of the primary challenges security teams face with legacy approaches to discovery and classification, and how next-generation tools are using cloud-native and AI-powered approaches to innovate in the space. You'll also learn about its relationship with data security posture management (DSPM) and how it relates to the trend toward zero-trust security practices.

The History of Data Classification

Data classification has a long history, starting with government and military data schemes using labels like confidential, secret, and top secret, to control access to critical information. In the late 1970s and 1980s, as computers became popular, the need to safeguard sensitive data from unauthorized access led to the development of access controls, such as usernames and passwords.

With the rise of the internet and communication platforms in the 1990s, protecting data during transmission became essential, giving rise to encryption methods like Secure Sockets Layer (SSL). In the early 2000s, government regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in 2003 and the Payment Card Industry Data Security Standard (PCI DSS) in 2004, enforced data classification and protection in the healthcare and financial sectors.

More recently, stringent data privacy regulations like the General Data Protection Regulation (GDPR) have highlighted the importance of sensitive data discovery and classification due to data breaches. While the core concept has existed since early computing, its formalization and widespread adoption have evolved to address digital complexity and privacy concerns.

The Necessity of Sensitive Data Discovery and Classification

In its simplest form, sensitive data is data that must be protected from unauthorized access.

Sensitive data can be broken down into a few of the following types, some of which have been mentioned previously.

Personally Identifiable Information

PII is data that can lead to identifying someone's personal identity. Data of this type usually includes Social Security numbers (SSN); biometrics, like fingerprints or facial scans; or any combination of data that, together, could lead to identifying an individual.

Personal Information

Personal information (PI) is a more general classification of data. PI can include PII but can also include other data that is clearly related to a person but does not necessarily identify a person. This classification is much broader and can include data like the following:

  • Location information
  • Photographs
  • Racial origin
  • Criminal record
  • Health or genetic information

Material Nonpublic Information

Material nonpublic information (MNPI) is data relating to a company, including its holdings, subsidiaries, and any other information that could have an impact on a company's share price. This information includes things like the following:

Any of this information could have an effect on a company's share price, and therefore, this information can be used to gain an advantage when trading stock, which is highly regulated and usually illegal.

Protected Health Information

Protected health information (PHI) is a sensitive data type specifically regulated by HIPAA and includes eighteen identifiers, including, but not limited to, the following:

  • Names
  • Phone numbers
  • Location information
  • Account numbers
  • Medical record numbers

Other Data Types

There are many other data types not covered in this guide, but as you can see, data classification is important, especially if it's regulated by a national or international regulation, like the GDPR.

Impact of Cloud Migration on Sensitive Data Discovery and Classification

In modern computing, more and more companies and services are moving their data to the cloud. This transition simplifies the process of scaling your solution, as there's no need to invest in additional hardware. Moreover, cloud hosting providers offer automatic redundancy, reliability, and backup. Disaster recovery can also be automated and integrated into your storage plan.

However, this doesn't necessarily mean that identifying, classifying, and protecting sensitive data is any easier with cloud storage. In a traditional data center model, the company is responsible for security across its entire operating environment, including your applications, physical servers, user controls, and even physical building security. In a cloud environment, the cloud solution provider (CSP) offers valuable relief by taking on a share of many operational burdens, including security. To clarify how responsibilities are split, CSPs introduced the concept of the shared responsibility model. This model establishes the responsibilities that fall to the CSP and to the company’s security team as applications, data, containers, and workloads are moved to the cloud. Defining the line between your responsibilities and those of the CSPs is imperative for reducing the risk of introducing vulnerabilities into your public, hybrid, and multi-cloud environments. 

The average business manages 10 or more cloud environments today, across Information-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS) deployment models. As the image illustrates, one common factor across these cloud environments is that the responsibility to secure data falls to the business, not the CSP. This highlights a key complexity for security teams as the businesses they support migrate data to the cloud. The permissive nature of the cloud, especially in SaaS environments, makes it easier for data to proliferate and be shared, and more challenging for IT and security teams to manage and maintain visibility and control over that data.

Historically, tools that bundled data discovery and classification capabilities relied on human interaction to enable them. In order to discover a datastore, tools like data catalogs, information management systems, and data loss prevention (DLP) tools, require humans to manually connect the tool to the datastore. This is typically achieved using a JDBC or ODBC connection, an API, or a network proxy to detect traffic going to and from a datastore. This means that the people implementing and administering the systems must have knowledge of a datastores existence, where it is located, and how to connect the tool to that system.  

Similarly, for classification, humans bear a significant up-front burden in establishing the metadata and tagging required for a classification tool to be effective. Defining metadata, including Microsoft Information Protection (MIP) sensitivity labels in Microsoft 365 environments, and manually creating classifiers to define the detection mechanism for the data class are required. The latter requires regular expressions (RegEx), sample data, and sample objects that the tool can match the provided pattern to data in the connected environment. A large number of companies still manually maintain their data inventories using these methods and they’re being hurt by the lack of automation offered by their data discovery tools.

Most Tools Require Manual Data Discovery

Today, modern, cloud-native tools are implementing automated processes to keep pace with the way businesses create, consume, and use data. Historically administrators had to manually develop the skills to discover and organize data in different datastores. This would be an incredibly time-intensive procedure that would more than likely be performed on top of an employee’s existing job description.

Manual processes have led to an astonishing 74 percent of security decision-makers estimating that their organization’s sensitive data was breached at least once in 2022. In a recent study that Cyera commissioned with Forrester Consulting, 59 percent of security leaders admit they struggle to maintain a detailed data inventory. Manual data discovery and classification tends to be very error-prone and individual employees need extensive institutional knowledge to be able to perform the function at an acceptable level.

There are several additional complexities you need to take into consideration, including the following:

  • Data location and residency: Some regulations (like GDPR) specifically regulate where data can be stored, especially the data of European Union (EU) residents. With cloud storage, you might not even know what data centers your client or customer data is located in.
  • Data encryption: While cloud storage does offer encryption, ensuring a consistent encryption policy across all your different data types can be difficult.
  • Integration with data discovery tools: More than likely, extra configuration and adaptation are required if you want to integrate your data discovery tools with your cloud storage.

In general, the engineering side of data storage is easier, but data security is exponentially more complex. It's harder to locate (both geographically and computationally) as well as secure different types of sensitive information that you could have across your entire organization. In addition, the static classifiers that at best aim to define an individual data class, but cannot identify the role, region, identifiability, or security that provides critical context on the data has historically added additional complexity and manual processing to making the classifications actionable for security and privacy teams. 

Role of Data Discovery and Classification in Security and Compliance

The different data types also highlight the need for data discovery and classification, especially as they relate to your security posture and regulatory compliance.

There's an emerging security trend called DSPM that aims to answer a few questions about your data and its security, including the following:

  • Where is my sensitive data located?
  • What sensitive data is at risk?
  • What can be done to mitigate or remediate that risk?

Sensitive data discovery and classification form part of your DSPM strategy, as illustrated in this diagram:

As you can see, having a DSPM strategy is important if your organization handles any sort of sensitive data, and data discovery and classification tools, like Cyera, are an important part of that strategy.

Real-World Use Cases for Sensitive Data Discovery and Classification

There are many use cases for sensitive data discovery in the real world. A few common ones are discussed in the following sections.

Compliance

Your data discovery tools need to recognize that different types of data need to comply with different regulations and security standards. If you're dealing with HIPAA-type data or you're doing business in the EU, your data discovery solution needs to make sure that your data practices adhere to the practices laid out by these regulations.

Some jurisdictions and countries, like the EU and the Philippines, give their users more control over their own personal data. Laws and guidelines published in these areas give data subjects some power to exercise their "right to be forgotten," at least to some degree.

Under the GDPR, specifically, data subjects also have the "right to be informed," which a user can use to query any third party on the location of their personal data that the third party might be storing.

A good data discovery tool should be aware of these standards and rights and should try to discover and classify any found data accordingly.

Mergers and Acquisitions

Buying or merging a company with another can bring all sorts of complexities to your DSPM. You have no guarantee that the company you're looking to acquire has been following regulatory practices.

A data discovery and classification tool is essential to assess the security posture of the company you're looking to acquire or merge with.

Beyond security, you'll likely end up inheriting the other company's data set, including any sensitive information they might have regarding their customers or partners.

The process of discovering and classifying this data is essential, not only for integration purposes into your company's databases but also for identifying any gaps in terms of risk.

Incident Response

In the event of a data breach, part of the incident response is to identify and classify the types of data that were leaked in the breach.

This process dictates how you need to respond to the breach, regarding every facet of it, including breach disclosure requirements and communication to your clients and/or business partners.

Other Approaches to Data Discovery and Classification

In a large organization, there are different strategies you can use to locate and classify sensitive data. Each approach comes with its own pros and cons.

Siloed Approach

Using a siloed approach, you make it the responsibility of the different departments to identify, manage, and locate different parts of sensitive data that they're responsible for.

This is considered a decentralized approach, and it has a few benefits:

  • Specific teams understand their own data better than trying to understand everyone's data.
  • This leads to the improved customization of the tools they use, tailoring them to suit the specific data types they handle.

However, there are downsides, too. For instance, silos can hinder collaboration between departments and might not adhere to company-wide best practices. Additionally, it becomes increasingly likely that your teams are duplicating work efforts that could be more efficiently managed by a dedicated department. Perhaps most concerning, however, is the fact that siloed visibility and data management masks data drift, data proliferation through shadow and copy data, overly permissive access, and data misuse. In all of these cases, as data moves across an organization it traverses silos of visibility and management, which makes it increasingly likely that misconfigurations, misuse, and malicious activities will go undetected. This in turn increases the likelihood of a breach.

Hub-and-Spoke Approach

By implementing a hub-and-spoke approach, the responsibility of discovery, classification, and management of your sensitive data falls on the shoulders of a central team dedicated to this function.

Again, this approach has its pros and cons. From an oversight perspective, it's easier for a central team to make sure that all data is covered under company-wide policies regarding data classification and security. Additionally, a centralized team can more easily create a standardized method and/or criteria for classification efforts. It's also more efficient since there's barely any risk of other teams doing the same kind of work for the same overlapping data sets.

However, if a centralized team does not have enough resources, it could become a bottleneck for the onboarding or classification of new data sources, especially if your organization is large and complex. Moreover, a centralized team can only enforce what the company gives them the power to enforce. If official policy does not dictate that the team has the power to enforce their classification policies within other departments, they might get ignored or seen as an inconvenience.

The Future of Sensitive Data Discovery and Classification

While DSPM is a relatively new and emerging trend, it's quite clear that the industry needs it going forward.

There are already data security platforms, like Cyera, that implement machine learning algorithms to learn about the specific data types in a customer's environment. Their software can also connect to an organization's cloud infrastructure using a single IAM role, which enables continuous, agentless scanning of your data residing in the cloud. This is an especially important factor as more and more organizations are moving their data into the cloud. 

Conclusion

Sensitive data discovery and classification are important processes that help you identify what sensitive data is in your environment, which in turn informs you as to what your data security strategy must be. It's also an integral part of the DSPM framework, which helps you to identify and mitigate the risks associated with any sensitive data you might be managing. Security leaders expect to get the most transformational benefits from improving data security using intelligent automation. To achieve this they are investing in real-time exposure detection and data security posture management.

This shift promises to improve security policy automation and orchestration, with demonstrable impacts in:

Decreased Time to Value

78 percent of security leaders say that accelerating time to value for their data security solutions is critical or very important. Cyera is implemented with a single IAM role that enables dynamic datastore discovery across deployment models. That means it continuously detects new and changed datastores without human involvement, which keeps pace with the rapid rate of change in cloud environments.

Improved Classification and Detection Accuracy

74 percent of security leaders are investing in automatic data inventory creation and maintenance, and 71 percent are prioritizing data classification accuracy improvements. Cyera’s AI-powered Data Security Platform makes classification fully autonomous, using ML and AI to achieve over 95 percent accuracy without human intervention. 

Enabling Dynamic Security Controls

81 percent of security leaders want to enable dynamic security controls. To ensure security teams can implement the right controls with confidence, Cyera implements LLMs to detect named entities and extract topics from environments to derive deep context to the data, including identifying the role, region, identifiability, and security of data to inform specific, fit-for-purpose controls. 

See how Cyera’s AI-Powered Data Security Platform applies these capabilities to all of a business's data everywhere.

If you'd like to learn more about data security posture management, check out this glossary for more information.

Author: Thinus Swart