Skip to main content

Traffic on Brooklyn BridgeI recently presented at the Open Data Science Conference (OSDC) in Boston. During my presentation, I addressed how a well-designed data lake solution can help solve fundamental data integration problems, and I used a real-world example involving analysis of a New York city taxi company’s route and fare data (which was stored in various formats depending on what was being tracked specifically) to illustrate how data analysis could improve outcomes for cab drivers, cab companies and the local government.

Once the presentation was complete, I was ready for the most interesting part: the audience Q&A. But rather than lining up to ask questions, the audience was ready to move on to the next session. Apparently, they felt that learning about big data integration (accumulating, cleaning, integrating, linking and analytics) is not very exciting. I found this curious, as for me the real value of sessions like this one at ODSC is when the discussion moves beyond technology itself and focuses on how the technology is applied. Because the question any business interested in big data needs to ask itself isn’t should we use big data to improve our business (of course you should!), but do we know how to use our data to improve our business?

To illustrate, let me describe a situation I’ve seen play out time and time again in the big data market. In order to increase efficiencies, lower capex, and/or improve their products and services, a company decides to take the plunge into big data. They hire the best data scientists they can to kickstart their big data analytics program; focusing on candidates with advanced degrees in mathematics or computer science and a deep understanding of the technologies and players in the big data ecosystem. They then start an exhaustive review of competing big data platforms to determine which one best suits their use case. After months of work and hundreds of thousands of dollars, the new data analytics team launches the analytics platform, analyzes terabytes of new and legacy data, and shares the results with management, only to discover all of their time, money and effort was wasted. The data analysis yields no new insights, or worse, gibberish.

Why? In my experience, the majority of these failures aren’t due to a data scientist’s lack of technical ability or a shortcoming in the analytics platform. Rather, the failure occurred because management was asking their big data team to solve a problem without having the requisite domain experience and a data integration strategy. Without having domain experience to influence its analysis, even the most powerful data analytics platform can’t tell a business anything of value if it’s not drawing upon the right dataset. After all, you wouldn’t give a teen with a learner’s permit the keys to a Ferrari to and expect them to deliver lap times in league with those of a seasoned Formula 1 professional, would you?

To companies interested in developing a big data analytics team, I would encourage the search committee ask themselves the following questions before they hire any candidate for data scientist.

  1. Do they have experience with our industry/target market? Even if that experience isn’t IT related?
  2. If a candidate doesn’t have the requisite domain experience, can they be mentored or paired with another team member to shorten the learning curve?
  3. Do we have a clear understanding of what we want our current data set can/can’t tell us? If there are gaps in our existing data set, how do we address them?
  4. What is the company’s data lake strategy? How are we going to optimize the data scientist’s time by providing him all the data in a single place?
  5. What are the candidate’s data processing skills? Do they have experience in techniques to profile, clean and normalize data?
  6. Does the candidate have experience with machine learning techniques? Are these techniques relevant to the problem domain?
  7. Most importantly, can the team actually use the analytics platform? Domain experts cannot be expected to learn complex programming languages like Java, C# or Scala. Instead, the experts should be able to use simple languages like SQL (or ECL) so they can start learning from the data quickly. The platform should also integrate data profiling and cleaning tools to help the domain experts get help sort data.