Data quality is an important part of any big data project. Recently, we have seen cases where bad data has the outcome of some major projects. One, for instance, is the case of Amazon AI recruiting tool showed bias against women. The underlying issue here is bad data as that tends to affect the final result. The model is built from the data, our hypothesis is drawn from that model. If the data is wrong, everything else is definitely going to be wrong.
Now, data cleaning is not an easy task. Over 50% of developers time is spent cleaning and organizing the data this is not to mention the spent collecting viable data sets. We have to ensure that data is accessible and available as at when needed. With all of the privacy issues going on now, organizations have to ensure that their data meets the requirements, especially for EU businesses.
We probably heard this in our Database 101 class, data consistency and integrity. I think sometimes this also affects the result of our analysis. Times have changed, one might say the reason some AI systems have “failed” or appear to be biased is that the data sets have not changed with the times we are in. I guess that will be more of the integrity of the data than its consistency.
What if the data was “perfect” and still the result was “wrong” or “unacceptable”. Could it be that human bias is to blame? Or using the right data but in the wrong scenario different from the original model?