Outliers are, by definition, extreme events. They occur infrequently and can be anything from a mild nuisance right through to a catastrophic disruption. One thing nearly everyone agrees on, is that they are a problem in advanced analytics and machine learning. Outliers skew our data, can negatively affect the calculation of what is “typical” or “expected” and can lead to misfit machine learning models.
It’s all-too-common to see people remove outliers from their data. There is a valid business argument for this: outliers don’t represent typical business. And you wouldn’t want them to negatively impact your SLAs or KPIs. Many introductory machine learning courses, or textbooks advocate for the removal or outliers.
But outliers aren’t necessarily unexpected, nor truly infrequent. In manufacturing, downtime is a common problem which can be mis-reported if outliers are ignored. Similarly, production quality will often deteriorate if a manufacturing line is operating too fast, and removing outliers in this direction means that the business cannot truly model and optimise their production throughput. Proper handling of outliers is essential for predictive analytics and decision making.
In this presentation, we’ll demonstrate how you can use probabilistic programming to model the true data generating process. Outliers and all. We’ll look at examples including house price prediction, recent “anomalies” due to COVID-19 lockdown, and manufacturing.
Founder & Data Scientist, Lingo Search & AI
Nick is a senior Data Scientist who has worked alongside a broad spectrum of customers in New Zealand and Australia. He founded Lingo Search & AI to help customers with their digital transformation, harnessing data and predictive analytics to drive better business outcomes. Lingo work with companies at all stages of their data journey, from initial discovery, proof of concepts through to production-level predictive solutions.