Leveraging Organizational Data for AI/ML and LLMs: A Strategic Framework

Mukul K.

Jul 9, 2024

Leveraging Organizational Data for AI/ML and LLMs: A Strategic Framework

In the previous article, I emphasized that Large Language Models (LLMs) are not merely a technical pursuit but a strategic imperative. In this discussion, I will explore how enterprises can effectively leverage data for AI/ML and LLM use cases. Drawing from my extensive experience as a CISO and now as an AI/ML strategy advisor, I have observed that many organizations lack a holistic approach to data management.

The Data Dilemma

Data is often considered uniformly sensitive in many organizations, i.e everything under the sun is important. However, there is a glaring disconnect: security teams and business units must clearly understand where data resides or how it is used. Each department operates in silos, creating, using, and storing data independently, often without standardized processes. This scenario is equally prevalent in AI/ML and LLM use cases, where technical teams need more visibility into the organization's data landscape. The question arises: Who is responsible for organizing, structuring, and protecting organizational data?

The Need for a Data Steward

There is a pressing need for a dedicated position responsible for managing the organization's data lifecycle—from creation to deletion, in compliance with legal requirements. This role should be distinct from that of the CIO or CISO. The CIO focuses on operational efficiency, and cost-cutting can conflict with the need for comprehensive data management. Similarly, the CISO's focus on data protection might limit innovative data usage. Therefore, a new role, ideally a Data Steward, is essential. This individual should possess intimate knowledge of the business and the ability to connect the dots across data creation, usage, and deletion.

A Critical "Step Zero": Understanding the Data Landscape

With the organizational structure established, it's essential to understand the broader data landscape. Organizations should conduct a comprehensive data audit to map out the current state of data across departments, which involves:

Catalogue Data Sources: Catalog all data sources, including engineering data, databases, applications, third-party services, and manual data entry points.
Data Formats: Recognize the different formats in which data exists, such as structured, unstructured, semi-structured, etc.
Data Flow: Understand how data flows through various processes within the organization, highlighting integration points and potential bottlenecks.

Practical Steps to Unlock Your Data's Full Potential

Step 1: Identify Data Storage Locations

The first step involves each department clearly defining its important data. Departments are best positioned to identify the data they produce and use and the conditions for its deletion. This step involves determining the 'crown jewels' at the departmental level and pinpointing their storage locations.

Step 2: Discover and Classify Data

Define clear and simple data classifications, ideally limited to three levels, to ensure usability across the organization. Simple classification schemes are more likely to be successful and easier to manage. Once the critical data, or 'crown jewels,' are classified and identified, initiate the discovery and tagging process. Selecting the right technology for this step is crucial, as many organizations struggle to reach this stage, making its achievement a significant milestone. Utilizing Data Security Posture Management (DSPM) solutions can automatically tag discovered data, further streamlining the process. Additionally, classified and tagged data simplifies the task of the Data Loss Prevention (DLP) team in identifying deviations and exfiltration attempts, providing a tangible benefit for CISOs.

Step 3: Centralized Data Visibility Platform

Organizations should focus on implementing a centralized platform that provides comprehensive visibility into all their data. This approach helps reduce identified risks, such as inadvertently using sensitive data by models that shouldn't have access or users accessing AI tools and data they shouldn't. For instance, a centralized visibility platform can alert administrators if an unauthorized user attempts to access confidential customer information through an AI tool.

A few years ago, while helping a major telecom giant, we highlighted a significant problem where customer data was stored in different locations according to each department's needs. Marketing, finance, and engineering each had their own data stores, resulting in multiple copies of the same data set. This situation created a complex ecosystem, posing massive security and compliance issues. If they implemented a central platform with comprehensive visibility, this problem would have been simpler. By preventing data silos, ensuring compliance, and enhancing security, such a platform offers a complete overview of data access and usage, ultimately safeguarding the organization’s data assets.

Step 4: Implement Basic Technical Best Practices

Implement normalization and obfuscation of data from production to development environments.
Regularly back up data to ensure data integrity and availability.
Employ Role-Based Access Control (RBAC) and Two-Factor Authentication (2FA) to enhance data security.
Conduct periodic audits and penetration tests to identify system vulnerabilities and unauthorized access.

Starting with this structured approach will help organizations lay a robust foundation for leveraging organizational data in their AI/ML journey. This framework mitigates security risks and minimizes the accumulation of technical debt.

How would you help organizations unlock the full potential of their data assets, drive innovation, and gain a competitive advantage? Let’s discuss.