PClean: The Latest System that Automatically Cleans Spoiled Data
Proper cleaning of data has become a top priority for businesses today.
Data cleaning is becoming increasingly important in all fields that rely on high-quality data as the emphasis on data analysis and insights grows. Since poor data quality has a direct impact on the analysis’ efficacy, getting the data cleaning properly has become a top priority for businesses today.
According to MIT News, MIT researchers have created a new system that automatically cleans “dirty data” — the typos, duplicates, missing values, misspellings, and inconsistencies dreaded by data analysts, data engineers, and data scientists. The system, called PClean, is the latest in a series of domain-specific probabilistic programming languages written by researchers at the Probabilistic Computing Project that aims to simplify and automate the development of AI applications (others include one for 3D perception via inverse graphics and another for modeling time series and databases).
PClean is said to use a knowledge-based technique to automate the data cleaning process, in which users interpret context knowledge about the database and potential problems.
PClean draws on recent advances in probabilistic programming, such as a new AI programming model developed at MIT’s Probabilistic Computing Project that makes it very easy to apply practical human knowledge frameworks to data interpretation. PClean’s improvements are based on Bayesian reasoning, a method of weighing possible reasons for ambiguous data by applying probabilities based on prior information to the data in question.
It also mentioned that PClean is the first Bayesian data cleaning system that can combine domain expertise with common-sense reasoning to automatically clean databases of millions of records. PClean achieves this scale via three innovations. First, PClean’s scripting language lets users encode what they know. This yields accurate models, even for complex databases. Second, PClean’s inference algorithm uses a two-phase approach, based on processing records one-at-a-time to make informed guesses about how to clean them, then revisiting its judgment calls to fix mistakes. This yields robust, accurate inference results. Third, PClean provides a custom compiler that generates fast inference code. This allows PClean to run on million-record databases with greater speed than multiple competing approaches.
Without the vast expenditures in human and software systems that data-centric organizations currently depend on, PClean makes it cheaper and more efficient to join chaotic, fragmented databases into clean records. This has possible social advantages — but it also has drawbacks, such as the possibility that PClean would make it easier and cheaper to violate user privacy, and perhaps even de-anonymize them by combining outdated data from various public sources.