Embedding Data Quality in Data Strategy & Design for AI
3AI January 5, 2024
Author: Prabhu Chandrasekaran
AI has been there over a decade, and with Gen AI touching newer frontiers and pushing the envelope across boundaries irrespective of industries and part of the society, One thing that is clearly emerging world is not the same and – “Data” is not mere oil but a “Strategic Asset”. Just like any asset there is component of compounding and decay depending on the quality of asset. As the theme AI engulfs organizations at large the data that is fed to train the models become pertinent and so is their quality. This paper aims to reiterate the importance of data quality and touches upon the broader theme of an hybrid approach towards data quality which encompasses AI
As we say in transformation world, do not automate a defective process as it would compound the defects in leaps and bounds. Same thing holds good with AI ,as we are at the cusp of time where organizations must prioritize data quality in their agenda., and compliment with their AI priorities. Organizations rich with quality data have an edge and will lead the way in their sector., through adoption of emerging tech and innovation.
Most of the organizations experience the issues and challenges that arises due to the poor data quality. According to Gartner, poor data quality costs organizations an annual average of $12.9 million., and hinders the organizations ability to prioritise and allocate resources for the right business imperatives . Poor data quality leads to poor customer experiences and missed opportunities through leveraging data as strategic asset. This is in a way chicken and egg story.
As more and more AI transforms itself as data generator through applications, interactions, and transactions from being data consumer (like a model trained through data), the issue would gobble up in space and scale in no time. This issue becomes even critical since AI is driven by data. Quality of data determines the quality of AI effectiveness. It’s an interesting debate to see what precedes what and that’s for a different day, of course there would be actors in the system to drive the outcomes in their favour and skew the data, this is again a facet that has to be addressed systematically.
“Data quality is an investment, not an expense.” – Thomas C. Redman, Author of “Data Driven Decisions: The Importance of Data and Analytics in Business”
In a nutshell, data quality for AI is an issue caused by default design and synthetic influence. This is an area where organizations battle the issues like inability to take timely decisions and issues associated with biases and ethics., Experiments have revealed the AI biases skews the outcomes significantly contrary to the larger organizations or societal vision. Example in the images generated by Midjourney on the prompts of different professions… What Can AI-Generated Images Tell Us About Society’s Biases? is a good example of how existing data is biased and non-inclusive., and builds a more compelling reason to prioritise data quality
Data quality issues can be addressed through broadly accepted conventions of parameters of Completeness,Accuracy, Consistency, Validity, Uniqueness and referential Integrity.,. through frameworks like AIMQ, CDQ, COLDQ, DQA, DQAF,HDQM,HIQM,OODA DQ,TBDQ, TDQM etc However, this has been in theory in many organizations and most of the organizations had limited success in embedding these frameworks . Hence moving away from “Classical” data quality management, let’s see how an integrated approach with AI can help in driving Data Quality Agenda. An approach towards data quality management must be a hybrid integrated approach that leverages appropriate data strategy and AI technology.
Data Strategy & Design: Data is a strategic asset and quality data asset building requires a comprehensive data strategy and design . In a typical DALC – Data Asset Life Cycle, which defines 6 phases of Create, Store, Use, Share, Retain and Destruct, it’s important that organisations have a strategy and design for each phase. Today, we see a disproportionate focus on data usage and sharing in organizations, the rest of the areas are left to the technology strategy – which focuses only on the use of technology rather than keeping data as a key outcome. This requires an interplay between data strategy and IT (tech) strategy in parallel
Key Considerations during this Strategy and Design phase would be the contextualisation of data, the kind of information system (technology), type of business and its requirement. The approach should broadly cover definition of relevant data quality attributes, data quality evaluation criteria and data quality improvement and enrichment methods on an ongoing basis.
Post having a Data Strategy & Design, AI could be leveraged to improve quality of data and minimize the downsides for an organization through interventions throughout the DALC( Data Asset Life Cycle) This could be in areas like
1. Gen AI can be used effectively in data collection through systematic interaction and connecting multiple sources of information to ensure data collected is of high quality at source. Ex: Interactive questionnaire with real-time validation with connected sources and searches.
Siemens is using generative AI to clean and augment data for its industrial machinery. This helps Siemens to improve the accuracy of its predictive maintenance models, which can help to prevent machine failures and improve uptime. Siemens Energy is using AI to check for missing data in its industrial equipment data. This helps Siemens Energy to improve the reliability of its equipment and prevent unplanned downtime. Siemens Energy is also using AI to add missing information to its industrial equipment data, such as sensor readings and performance metrics.
2. Leveraging AI for Data Profiling and Cleansing the raw data to identify issues and anomalies, traditionally this part of the cleaning the dirty data takes several days and man hours and are subject to errors – with AI and ML performing the data profiling and cleansing activities can make the data robust and reliable at source. Ex: Check for duplicates, invalid data ,formatting errors and fixing them by appropriate correction
Walmart is using Generative AI is used to label images of its products. This helps Walmart to automate the process of product categorization and tagging, which can save time and improve accuracy.
3. AI can be augmented to undertake validation checks against predefined rules and quality standards defined through the data strategy and gaps can be addressed with a suitable imputation technique through appropriate ML . Ex: Check for missing data and add necessary information basis the AI search or introducing data through imputation methods
Amazon is using AI to check for missing data in its product catalog. This helps Amazon to ensure that its customers have access to accurate and up-to-date information about its products. Amazon is also using AI to add missing information to its product catalog, such as product reviews and images.
4. Synthetic data can be augmented with precision and there are different AI approaches that leverages different techniques and applications of machine learning and statistical analysis for Anomaly Detection and replacing with reliable data which help in transforming the poor-quality decaying data to high value asset. Ex: AI & ML approaches can be leveraged to generate data algorithmically basis real world scenarios which can be used to validate data models and further train machine learning during the conditions of data non availability
Target is using AI and ML to generate synthetic data for its customer segmentation. This data is used to train machine learning models that can segment customers into different groups based on their demographics, purchase history, and other factors. This helps Target to target its marketing campaigns more effectively.
Target uses data to improve its customer experience. The company’s data strategy identifies the need to collect data about customer preferences, shopping habits, and demographics. The IT strategy defines the use of a cloud-based data warehouse to store and manage this data. The data warehouse is then used to build machine learning models that can predict customer demand, personalize product recommendations, and prevent fraud.
Framework for Leveraging AI for Data Quality in the Data Strategy Stages
The below offers a framework that outlines the various stage of Data Strategy, key outcomes and how AI integration can be explored to augment and deliver improving data quality
|Data Strategy Stage
|AI Integrations that can be explored
|Data Acquisition and Ingestion
|Identify necessary data sources and types
|Utilize AI for data cleansing and validation during ingestionImplement data acquisition methods like scraping or APIs.
|Data Storage and Management
|Establish secure and scalable data storage infrastructure
|Utilize AI for metadata tagging and catalogingEnforce data quality and security standards using AI-driven tools
|Data Processing and Analysis
|Convert raw data into a form that is fit for use, and enable to extract meaningful insights, patterns, trends, and relationships
|Apply AI for preprocessing, feature engineering, and dimensionality reductionUtilize machine learning and deep learning models for analysisAutomate data pipelines for real-time or batch processing
|Data Visualization and Interpretation
|Transforming data into a visual format for effective communication of insights and understanding
|Adopt AI-powered analytics tools for generating insights and predictionsLeverage AI to develop user-friendly interfaces for non-technical stakeholders
|Data Monitoring and Maintenance
|Track and maintain the quality and integrity of data, and align to the data distribution, archival and disposal compliance polices
|Implement AI-driven monitoring for anomalies and data quality issuesSet up automated data maintenance routines, including archiving and purgingTrain and retrain AI models to adapt to evolving data
In Summary, Gen AI has caught the world by storm, according to economic impact potential of Gen AI study by McKinsey indicates generative AI has the potential to generate $2.6 trillion to $4.4 trillion in value across industries. However, the realisation of such high potential opportunity involves a calibrated data strategy, design and use of AI prioritising data quality to maximise their Return on Investment.
Gartner report also indicates data decay in organizations range between 2.5% per month to 5.8% per month that is whopping 30% to 70% per annum. The half life period of data is diminishing as the quantum of the new data generated is exploding. Hence the organizations in whatever stage they are have to prioritise data quality standards and align data strategy along with technology strategy, adopt suitable frameworks to have appropriate data quality standards infused at the design and definition . This shall aid the adoption of AI effectively for better RoI and avoid the downsides of AI Models generating algorithmic bias due to poor data quality and leading to suboptimal and negative returns.
Of course, irrespective of the AI leverage an Iterative Refinement and Human-in-the-loop shall help to ensure better quality and handle complex cases, provide domain knowledge, and refine results., Certainly AI can reduce the effort, time, cost and enable businesses to perform at scale when used in tandem with right data quality framework.
Prabhu is Senior Leader in the Data and Analytics Industry, bringing in more than 2 decades of experience in Business and Digital Transformation. Passionate professional of ESG Analytics and Reporting – He inspires organisations and leaders on Data Strategy and through Business Insights and works in the area of building effective processes and governance for analytics functions to enable compliance and execution excellence
Title picture: freepik.com