Quantcast
Viewing all articles
Browse latest Browse all 75

Tell me… how ugly is your bad data?

The adage ‘garbage-in-garbage-out’ is an analytics mantra so ingrained it has its own shorthand: GIGO. Yet, in the mad, blind rush toward all things ‘big data’, there is the danger of sidelining the crucial-but-dreary topic of data quality, to which GIGO refers.

While data quality is not as ‘sexy’ as big data, anyone who wants to work with big data or fancies themselves a data scientist will quickly run smack into a ‘big bad data wall’ without explicit forethought. The discipline of Master Data Management can help quell the pain – knowing the basics can avoid a world of ‘big hurt’!

Image may be NSFW.
Clik here to view.
Bad Data

You saw the terrifying movie, now buy the book!

Jurney, R. 2014. Agile Data Science. O’Reilly.Jurney, R. 2014. Agile Data Science. O’Reilly.

Tell me… how ugly is your bad data?

While fast evolving tools and techniques allow us to massage and manage sloppy data, when the rubber meets the road, at best, bad data poses fundamental challenges to an analytics inquiry. At worst, bad data results in misleading insights, which spawns poor, even destructive, decisions. Such perverse results can even remain hidden – decision flaws in-waiting – until disaster strikes.

A key point to assimilate, internalize, and imbibe is that data quality is only partially a technical problem. The scourge of bad data encompasses and often finds its very origins in organizational, as opposed to technical, challenges. At a fundamental level, data quality is thus an organizational challenge: one of governance, aligned incentives, proper processes, and even culture.

Business analytics itself is an organizational process: framing problems which can be addressed with data analysis which leads to insights that drive value-creating decisions.

  • Business analytics (beyond data science) is a process which addresses the coupling of problem framing, data management, and analytics to assure decision quality.
  • Want to know more? Online class on the business analytics process (developed by SARK7 for Accenture Academy)

Bad data thus encompasses situations where poor problem framing (a broken business analytics process) and breakdowns in organizational decision culture perpetuate poor analytics, such as the well-known case of the 1986 Challenger space shuttle disaster.

Excuse me, would you care for a big, steaming heap of bad data?

For those just getting started with analytics, it is often a shock how much time is spent on gathering, cleaning, sorting, and preparing data for analysis. Often there are several rounds of data cleaning as an analytics model evolves, leading to a rinse, wash, spin, repeat cycle.

Many analytics projects follow a classical Pareto principle 80 / 20% distribution between ‘data cleansing’ and actual analytics (indeed I have had 95/5% projects).  Much of this time involves gathering, combining, re-formatting, sorting, compacting, ‘munging’ (or wrangling), and attempting to structure and make sense of data which is often in a messy, low-quality condition.

But what happens when the data is fundamentally flawed and analysis is thus compromised?  Sometimes the mission is hopeless! What happens when there are seven product databases and multiple departments disagree on key aspects such as ‘base price’?  What happens when a large circle of security databases update each other in an endless, mechanical chain such that ex-employees keep being returned for systems access (as was one project I had in the past at a company which shall remain nameless)?

The truth is, most all businesses struggle perpetually with fundamental issues of data quality.  Typical businesses often have a hodgepodge of multiple data sources (spreadsheets, databases, unstructured documents, etc.) surrounding such key artifacts as ‘customer’ and ‘product’.  These struggles are organizational problems more-so than technical problems: breakdowns in data ownership and governance.  Tools can help to improve processes, but basic organizational roles, agreements, and incentives need to be put in place to drive true change.

This is where MDM comes in.  MDM is a discipline which focuses on bringing organizational processes, governance, and systems together to improve data quality. A major objective is to establish a ‘single version of the truth’ in terms of data definitions. Where there are disagreements, for instance based on different professional domains, MDM brokers explicit definitions concerning the distinctions. Tools include metadata dictionaries and/or ontologies – formal descriptions of contextual and conceptual meaning within a domain.

But… I’ll just dump it into a big data store and worry about it later!

A suggested advent of big data is that of ‘collect all the bad data and clean it later’. While Hadoop and other mass storage approaches make this increasingly technically feasible, the ‘clean later’ part does not, as a result, go away. ‘Clean later’, as in “I’ll clean my house / do my homework / pay my taxes next week”, runs the danger of never happening, or worse, of dysfunctional data hoarding leaving servers jam packed with a mess of crud!

The emerging big data processing ‘stack’ implies that data will be ‘cleaned’ and presented for analytics as part of as structured process:

Image may be NSFW.
Clik here to view.
Flow of data processing (from Jurney, R. 2014. Agile Data Science. O'Reilly)

Flow of data processing (Jurney, R. 2014. Agile Data Science. O’Reilly)

An example technical ‘stack’ here would be (also from Jurney’s Agile Data Science):

Avro -> IMPA -> Hadoop -> Pig -> Mongo DB -> Lightweight web framework -> D3

This is all great! This is an engineering solution to storing, extracting, transforming, and presenting large sets of data. However, if we wish to perform data analytics, the use of powerful technology does not issue a ‘get out of jail free’ pass.

The assumption is that somewhere in the ‘middle part’, magic happens whereby reasonable sense is made of the massive set of data such that there is integrity in the business analytics process. At a minimum, this encompasses a set of organizational procedures:

  • proper problem framing (strategic and tactical governance; stakeholder alignment),
  • validation of data quality (which assumes a link to MDM),
  • proper data selection and sampling (proper data analytics methods applied),
  • model building (“),
  • model testing (“),
  • proper interpretation (“), and
  • clear communication of results (organizational stakeholder communications).

In the context of the big data ‘stack, such orchestration assumes that technology tools, processes, methods, and organizational stakeholders are aligned. An MDM program and a clear business analytics process assure that quality and risks are formally addressed.

A particularly troublesome challenge concerns properly confronting the methodological issues raised by large sets of data (both large sample sets as well as large ranges of variables). There is a pernicious myth that massive and broad sets of data issue some type of methodological omnipotence. This is not the case: large datasets are particularly subject to issues regarding model overfitting and variance.

The flip-side is that tight / targetted models are suceptable to bias (work in many cases, but potentially overgeneralize).  These are flip-sides of a coin in modeling – ideally a data scientist seeks a sweet-spot between the two, but there is always a compromise one way (high variance) or the other (high bias).

A recent article in Science, ‘The Parable of Google Flu: Traps in Big Data Analysis’, concerning issues with the Google flu trends platform goes into some detail on this topic. As well, issues of mistaking correlation for causation abound. Big data sets produce multiple models, many of which may involve spurious, context specific, or phantom correlations.

The details of such methodological issues are still being debated. Part of the issue is that machine learning involves a paradigm shift from statistical methods. Principles for validating and testing machine learning methods are still being developed and socialized. This means that extra vigilance needs to be applied when attempting to assert causal conclusions from machine learning-derived insights, especially when relying on computer-built or heavily computer-guided models focused on correlation (as opposed to models rigorously tested for causal indications via traditional tests for statistical significance).

In conclusion, big data is not a panacea. Technology is ineffective without proper processes and organizational application. As well, there are methodological issues associated with large sets of data which must be confronted explicitly.

Do you suffer bad data? I recommend pursuing a MDM program and implementing an end-to-end business analytics decision process. A Hadoop implementation alone will not lead to effective big data analytics…

  • Want to learn more? A short presentation on methodological challenges associated with Big Data analytics by Scott Mongeau (presented to Erasmus Rotterdam School of Management): https://www.youtube.com/watch?v=UPsJx427rKE

DO YOU HAVE A BAD DATA STORY?  Leave a comment below…


Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.

Viewing all articles
Browse latest Browse all 75

Trending Articles