How to check your data quality? Starts to evaluate its Representativity!

Introduction to Data Representativity

At the time to envision data based approach to optimize your industrial operations efficiency or when you will start to specify your first application, Data Quality and Data Representativeness will immediately raise.

Based on a simple manufacturing scenario, we are going to explain, in this post, what Data Representativity means, what are the most common problems and how they can be tackle.

First things first, let’s start by setting the scene.

Smart Manufacturing Scenario

Our modern manufacturing line is equipped with sensors and IoT devices that continuously collect data. We are measuring and collecting 4 types of data:

Physical Variables (temperature, pressure, humidity, speed, voltage, current, etc);
Machine Settings such as machine speed, tooling settings, and calibrations;
Sensor Data (accelerometers, pressure sensors or flow meters for example);
And finally Quality Control Inspections and tests such as product dimensions, tolerances or simple defect counts.

In our scenario, we want to learn the relationship between Physical Variables, Machine Settings and Sensor Data (called “Input Features”) and, in the other hand, the Quality Control Inspections (called “Desired Outcome”). Our objective is to perform Quality Deviation Root Cause Analysis in nearly real-time.

What is data Representativity?

Data would be collected during a certain period of time. The data collected includes both input features and the desired outcome. We get what we called “Historical Dataset”. Analyzing this dataset will lead us to model the relation between input features and desired outcome. Once verified, this model could be deployed to be used on real time.

Data representativity in the context of machine learning for IoT (Internet of Things) refers to how well the collected data represents the entire range of possible scenarios and conditions that the IoT system may encounter.

Sounds obvious, isn’t it? But model trained on biased or unrepresentative data are one of the reasons Data Driven Projects failed in Industry. Let’s dig a little bit deeper in the most spread problems regarding Data Representativity.

The 7 most common problems in data representativity

Sampling Bias: if your historical dataset is a sample of all the data you are collecting, be sure that this sample embrace all the conditions of your manufacturing line.

Imbalanced Data: In some cases, one class or category within the data may be heavily overrepresented compared to others. This can lead to a model that is biased towards the majority class and performs poorly on minority classes. Imbalanced data can be a significant problem in scenarios like fraud detection or rare event prediction in IoT.

Temporal Bias: Data collected during specific time periods may not be representative of the yearly conditions. The examples in manufacturing can be the change of tools, the variation of raw materials, etc.

Limited Feature Representation: If important process variables are missing from the dataset (maybe not collected), the model may not capture all the factors that influence the target outcome. This can lead to poor generalization.

Outliers and Anomalies: Outliers or anomalies in the data can skew the model’s understanding of normal patterns and cause it to make incorrect predictions or decisions.

Data Drift: Over time, the distribution of data in an IoT system may change due to factors like system degradation or updates. If the model is not adapted to these changes, it may become less representative and less effective.

Human Biases: If data collection involves human input or labeling, human biases can inadvertently influence the data, leading to representational issues.

The importance of external Biases

In more specific manufacturing processes you can also face:

Geographical Bias: Data collected from one line and the model deployed in another line which can have even tiny differences.

Seasonal or Weather-Related Bias: If data is primarily collected during certain seasons or weather conditions.

User and Behavioral Variability: In applications where user behavior or preferences are relevant, data may not adequately capture the diverse behaviors and preferences of users.

In light of this, let me share some tips to guarantee a high level of representativeness of a dataset.

How can we assess the representativeness of a dataset?

Industrial organizations can employ various tools and methodologies to perform this assessment. Here are the 3 main domains of actions.

First of all, the Domain Expertise. Industrial experts who are familiar with the manufacturing processes will guide the selection of relevant features and data sources. Based on their knowledge, they will also provide valuable insights into what constitutes a representative dataset (missing features, temporal bias, etc.).

Using data visualization tools (histograms, scatter plots, box plots, and other visualizations) helps to reveal data distributions, outliers, and potential issues. Summary statistics (mean, median, standard deviation), correlation analysis, and hypothesis testing can also help quantify data characteristics and identify patterns or anomalies. Comparing different samples of the same overall dataset can be done to assess whether they exhibit similar statistical properties, validating its overall representativity.

The second kind of actions relies on Data Team Expertise. Among the diversity of techniques we will mention the most used: Clustering and Outlier Detection, Dimensionality Reduction, Feature Engineering, Data Augmentation and Cross-Validation. This article being targeted to industrial experts, we will not enter in the detailed here.

Data Governance is the last domain. Implement data profiling, data lineage tracking, and data quality monitoring will ensure data representativeness. Strong data governance practices, such as guidelines for data collection, storage, and usage, also maintain data quality standards at their best.

What else to consider?

In this post we have seen why ensuring representativity in the training data is crucial to represent correctly real-world scenarios. We have explain the most common problems in data representativeness. We have presented the main actions to ensure the best possible quality of a dataset.

This article aims to provide a first level of understanding to non-data experts. There are surely examples, stories, or insights that don’t fit into any of the previous sections. Please do not hesitate to contact us, we will be delighted to discuss your specificities.