Big Data gets a lot of press these days; some deserving, some not. It hasn’t cured cancer (although it might), but it has dramatically improved logisitcs, marketing, and a host of other business functions.
As convincing as its wins are, Big Data is not without its detractors. On the list of drawbacks often cited will almost always be cost (even though most Big Data software is Open Source and uses commodity hardware) and the shortage of hirable talent. But according to a study done by Deloitte, those two rank behind “Too Much Data”, which I’ll interpret as “Identifying the Right Data”.
In other words, how do you sort through the chaff to get the data that’s actually useful? And how do you know it’s accurate?
Data Cleansing (the process by which data is selected and its veracity confirmed) can account for up to 80% of a project, says Senior Principal Consultant Kevin Kline in a recent seminar at Dell Enterprise Forum. Identifying duplicates, misspellings, and other inaccuracies is hard enough, but there are tools to deal with that (such as the ones from Talend and Trillium).
But at least as important is identifying what is appropriate for the analysis you want to run. This ties into an earlier posting in which I posed the question “Do you know what problem you want to solve?”, and it remains the question of the day. Until you identify that you won’t know which data to separate out and clean, and you’ll be staring at a mountain of information wondering where to start.