Data science is a method for transforming business data into assets that help organizations improve revenue, reduce costs, seize business opportunities, improve customer experience, and more.
What is data science?
Data science is a method for gleaning insights from structured and unstructured data using approaches ranging from statistical analysis to machine learning. For most organizations, data science is employed to transform data into value that might come in the form improved revenue, reduced costs, business agility, improved customer experience, the development of new products, and the like.
“The amount of data you can grab, if you want, is immense, but if you’re not doing anything with it, turning it into something interesting, what good is it? Data science is about giving that data a purpose,” says Adam Hunt, chief data scientist at RiskIQ.
Data science vs. analytics
While closely related, data analytics is often viewed as a component of data science, used to understand what an organization’s data looks like. Data science takes the output of analytics to solve problems.
“Data science is coming to conclusions that drive your data forward,” Hunt says. “Understanding what your data looks like is analytics, but there’s no outcome beyond the data itself. If you’re not solving a problem with data, if you’re just doing an investigation, that’s just analysis. If you’re actually going to use the outcome to explain something, you’re going from analysis to science. Data science has more to do with the actual problem-solving than looking at, examining, and plotting [data].”
Data science vs. big data
Data science and big data are often viewed as connected concepts, but data scientists don’t just work with big data. Data science can be used to extract value from data of all sizes, whether structured, unstructured, or semi-structured.
Big data is useful to data science teams in many cases, because the more data you have, the more parameters you can include in a given model.
“With big data, you’re not necessarily bound to the dimensionality constraints of small data,” Hunt says. “Big data does help in certain aspects, but more isn’t always better. If you take the stock market and try to fit it to a line, it’s not going to work. But maybe, if you only look at it for a day or two, you can set it to a line.”
The business value of data science
The business value of data science depends on the organization it’s serving. Data science could help an organization build tools to predict hardware failures, allowing the organization to perform maintenance and prevent unplanned downtime. It could be used to predict what to put on supermarket shelves, or how popular a product will be based on its attributes.
“The biggest value a data science team can have is when they are embedded with business teams. Almost by definition, a novelty-seeking person, someone who really innovates, is going to find value or leakage of value that is not what people otherwise expected,” says Ted Dunning, chief application architect at MapR Technologies. “Often they’ll surprise the people in the business. The value wasn’t where people thought it was at first.”
Organization of data science teams
Data science is generally a team discipline. Data scientists are the forward-looking core of most data science teams, but moving from data to analysis, and then transforming that analysis into production value requires a range of skills and roles. For example, data analysts should be on board to investigate the data before presenting it to the team and to maintain data models. Data engineers are necessary to build data pipelines to enrich data sets and make the data available to the rest of the company.
Mark Stange-Tregear, vice president of analytics at eBates, says it’s essential to think in terms of teams, rather than seeking a “unicorn” — an individual that combines non-linear thinking with advanced mathematics and statistics knowledge and the ability to code.
“Data engineering I don’t think of as a key data scientist trait,” Stange-Tregear explains. “I want someone that actually adds something else. If I can have someone build a model, be able to evaluate the statistics, and communicate the benefits of that model to the business, then I can hire data engineers that are sophisticated enough to take that model and implement it.”
The embedded approach to data science
Rather than isolate data science teams, some organizations opt to commingle data scientists with other functions. For example, MapR’s Dunning recommends organizations follow a DataOps approach to data science, by embedding data scientists in DevOps teams with business line responsibilities. These DataOps teams tend to be cross-functional — cutting across “skill guilds” like operations, software engineering, architecture and planning, and product management — and can orchestrate data, tools, code, and environments from beginning to end. DataOps teams tend to view analytic pipelines as analogous to manufacturing lines.
“It’s not a data science team’s job to do data science in some abstract sense,” Dunning says. “You want to get value out of that part of the business using data. An isolated data science team might want to deploy the most sophisticated model. The embedded data scientist is going to look for cheap wins that are maintainable. They’re mercenary, pragmatic, about solutions they pick.”
That said, data scientists aren’t necessarily permanently embedded in DataOps teams.
“Typically, there’s a data scientist embedded in the team for a time,” Dunning says. “Their capabilities and sensibilities begin to rub off. Someone on the team then takes on the role of data engineer and kind of a low-budget data scientist. The actual data scientist embedded in the team then moves along. It’s a fluid situation.”
Data science goals and deliverables
The goal of data science is to construct the means for extracting business-focused insights from data. This requires an understanding of how value and information flows in a business, and the ability to use that understanding to identify business opportunities. While that may involve one-off projects, more typically data science teams seek to identify key data assets that can be turned into data pipelines that feed maintainable tools and solutions. Examples include credit card fraud monitoring solutions used by banks, or tools used to optimize the placement of wind turbines in wind farms.
Incrementally, presentations that communicate what the team is up to are also important deliverables.
“Making sure they’re communicating out results to the rest of the company is incredibly important,” RiskIQ’s Hunt says. “When a data science team goes dark for too long, it starts to get in a little trouble. Product managers take work for granted unless we’re talking about it all the time, selling it internally.”
Data science processes and methodologies
Production engineering teams work on sprint cycles, with projected timelines. That’s often difficult for data science teams to do, Hunt says, because a lot of time upfront can be spent just determining whether a project is feasible.
“A lot of times, the first week, or even first month, is research — collecting the data, cleaning it,” Hunt says. “Can we even answer the question? Can we do it efficiently? We spend a ton of time doing design and investigation, much more than a standard engineering team would perform.”
For Hunt, data science should follow the scientific method, though he notes that it’s not always the case, or even feasible.
“You’re trying to extract some insight out of data. In order to do that repeatedly and confidently, and to make sure you’re not just blowing smoke, you have to use the scientific method to accurately prove your hypothesis,” Hunt says. “But I don’t think many data scientists actually use any science whatsoever.”
Real science takes time, Hunt says. You spend a little bit of time confirming your hypothesis and then a lot of time trying to disprove yourself.
“With data science, you’re almost always in a for-profit company that doesn’t want to take the time to dive deeply enough into the data to validate these hypotheses,” Hunt says. “A lot of the questions we’re trying to answer are short-lived. In security, for instance, we’re trying to find the threat actor tomorrow, not next year — tomorrow, before he can release his threat to the wild.”
As a result, data science can often mean going with the “good enough” answer rather than the best answer, Hunt says. The danger, though, is results can fall victim to confirmation bias or overfitting.
“If it’s not actually science, meaning you’re using scientific method to confirm a hypothesis, then what you’re doing is just throwing data at some algorithms to confirm your own assumptions.”
Data science tools
Data science teams make use of a wide range of tools, including SQL, Python, R, Java, and a cornucopia of open source projects like Hive, oozie, and TensorFlow. These tools are used for a variety of data-related tasks, ranging from extracting and cleaning data, to subjecting data to algorithmic analysis via statistical methods or machine learning.
“The first tool a data scientist needs is eyeballs and fingers,” MapR’s Dunning says. “It’s very, very common that the simplest things provide value, especially when people are starting. Look critically at very simple aspects of the data. Look for hints about how things work.”
Tools will help data science teams extend those eyeballs and fingers.
“You need good visualization tools. Programming tools — Python is an odds-on favorite at this point. You need the tools that will actually build interesting models. You can’t survive with just one,” Dunning says.
When MapR surveyed its customer data teams, Dunning says, the smallest number of modeling tools used by a team was five, and that didn’t even get into visualization tools.
“Things are becoming more polyglot because people are more suspicious. Will this other modeling technique produce a better model?” Dunning says.
for more details from the source