After nearly two years of experimentation, many organizations are ready to scale up their generative AI (gen AI) projects. However, before doing so, IT leaders must rethink their approach to data management. According to industry experts, successful implementation of gen AI relies heavily on how data is collected, managed, and governed. This article explores three key aspects of data management that can significantly impact the success of gen AI projects: data collection and categorization, governance and compliance, and data privacy and intellectual property (IP) protection.
For many organizations, effectively managing data for gen AI projects involves a rigorous process of collecting, filtering, and categorizing data. This process is especially important for knowledge management (KM) and retrieval augmented generation (RAG) models, two of the most common use cases in generative AI today.
In KM use cases, enterprises collect and categorize vast amounts of internal data, which can then be fed into AI models to improve efficiency. RAG models, on the other hand, allow AI to interact with large datasets by vectorizing and indexing the data, making it easier for users to query and retrieve specific information.
According to Doug Shannon, a global AI expert, effective data management requires a deep understanding of both structured and unstructured data. While structured data is relatively easy to manage, unstructured data—though more challenging—is where the real value lies. "You need to know what the data is," says Shannon, "because it’s only after you define and categorize it that you can do anything meaningful with it."
Tools such as those provided by Nvidia can help enterprises manage and filter data by removing personally identifiable information (PII) or other sensitive content. These tools also allow for "data blending," a process where data from different sources is merged and balanced to meet specific requirements.
Ensuring high-quality data is crucial for AI model accuracy. Filtering out irrelevant or poor-quality data and using version control can improve the signals fed into AI models, increasing the likelihood of successful outcomes. Automating these processes can save significant time and effort, especially as datasets grow larger with the expansion of generative AI.
Data governance is another critical aspect of managing data for generative AI projects. As companies adopt gen AI, they must ensure that their governance frameworks are robust enough to handle new challenges related to automation, compliance, and regulatory changes.
Harvard University provides a compelling example of how data governance can evolve to meet the needs of generative AI projects. The university’s IT department developed an "AI Sandbox" for experimentation, offering access to large language models (LLMs) and other AI tools. However, the experimentation required careful consideration of data governance principles, particularly as AI models became more automated and less reliant on traditional structured data models.
Klara Jelinkova, Harvard’s VP and CIO, emphasizes that legacy governance models built around structured data often fall short in the world of generative AI. "We quickly realized that we needed to rethink data governance for automated data pipelines," Jelinkova says, "especially given the complexities of working with unstructured data and the velocity at which AI models operate."
Compliance is another key concern. With evolving global regulations like the EU AI Act, enterprises need to ensure that their generative AI projects meet all legal requirements before scaling up. Harvard's proactive approach includes a dedicated working group that monitors regulatory changes and ensures compliance with emerging standards.
As organizations deepen their reliance on data-driven AI models, protecting data privacy and intellectual property becomes increasingly important. Ensuring that data remains secure and that sensitive information is not compromised is paramount for the success of AI projects.
Data privacy concerns are particularly relevant when dealing with large AI models that require massive amounts of data. Role-based access control (RBAC) is one tool that organizations can use to manage who has access to different data sets, ensuring that sensitive information is not inadvertently shared with unauthorized users.
Harvard University’s approach to data privacy involves strict guidelines for handling proprietary information, particularly in AI projects where public models may be used. Jelinkova points out that Harvard takes extra measures to ensure that their data is not exposed to commercial exploitation. By negotiating contractual protections with third-party vendors, the university ensures that its valuable data is not used outside of agreed-upon parameters.
Doug Shannon, a leader in AI governance, warns that many organizations still lack transparency about how third-party tools handle data. "Even with assurances from vendors, there’s still a lot of uncertainty about what happens to your data once it enters these AI systems," he says. Therefore, it’s essential to put strong privacy protections in place from the start.
As enterprises look to scale their generative AI projects, effective data management will be crucial to their success. By focusing on three key areas—collecting and categorizing high-quality data, ensuring robust governance and compliance, and protecting data privacy and intellectual property—organizations can better navigate the challenges of AI deployment. In an era where data-driven AI models hold the potential to transform industries, getting data management right is the key to unlocking the full value of AI.