Types of Data
In this world of so-called digital transformation and cloud computing that drives our always-on über-connected lifestyles, surely it would be useful to understand the what, when, where and why of data on our journey
1 – Big data
A core favorite, big data has arisen to be defined as something like: that amount of data that will not practically fit into a standard (relational) database for analysis and processing caused by the huge volumes of information being created by human and machine-generated processes.
“While definitions of ‘big data’ may differ slightly, at the root of each are very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources and in different volumes, from terabytes to zettabytes. It’s about data sets so large and diverse that it’s difficult, if not impossible, for traditional relational databases to capture, manage, and process them with low-latency,” said Rob Thomas, general manager for IBM Analytics.
Thomas suggests that big data is a big deal because it’s the fuel that drives things like machine learning, which form the building blocks of artificial intelligence (AI). He says that by digging into (and analyzing) big data, people are able to discover patterns to better understand why things happened. They can also then use AI to predict how they may happen in the future and prescribe strategic directions based on these insights.
2 – Structured, unstructured, semi-structured data
All data has structure of some sort. Delineating between structured and unstructured data comes down to whether the data has a pre-defined data model and whether it’s organized in a pre-defined way.
Mat Keep is senior director of products and solutions at MongoDB. Keep explains that, in the past, data structures were pretty simple and often known ahead of data model design — and so data was typically stored in the tabular row and column format of relational databases.
“However, the advance of modern web, mobile, social, AI, and IoT apps, coupled with modern object-oriented programming, break that paradigm. The data describing an entity (i.e. a customer, product, connected asset) is managed in code as complete objects, containing deeply nested elements. The structure of those objects can vary (polymorphism) – i.e. some customers have a social media profile that is tracked, and some don’t. And, with agile development methodologies, data structures also change rapidly as new application features are built,” said Keep.
As a result of all this polymorphism today, many software developers are looking towards more flexible alternatives to relational databases to accommodate data of any structure.
3 – Time-stamped data
Time-stamped data is a dataset which has a concept of time ordering defining the sequence that each data point was either captured (event time) or collected (processed time).
“This type of data is typically used when collecting behavioral data (for example, user actions on a website) and thus is a true representation of actions over time. Having a dataset such as this is invaluable to data scientists who are working on systems that are tasked with predicting or estimating next best action style models, or performing journey analysis as it is possible to replay a user’s steps through a system, learn from changes over time and respond,” said Alex Olivier, product manager at marketing personalization software platform company Qubit.
4 – Machine data
Simply put, machine data is the digital exhaust created by the systems, technologies and infrastructure powering modern businesses.
Matt Davies, head of EMEA marketing at Splunk asks us to paint a picture and imagine your typical day at work, driving to the office in your connected car, logging on to your computer, making phone calls, responding to emails, accessing applications. Davies explains that all this activity creates a wealth of machine data in an array of unpredictable formats that is often ignored.
“Machine data includes data from areas as varied as application programming interfaces (APIs), security endpoints, message queues, change events, cloud applications, call detail records and sensor data from industrial systems,” said Davies. “Yet machine data is valuable because it contains a definitive, real time record of all the activity and behavior of customers, users, transactions, applications, servers, networks and mobile devices.”
If made accessible and usable, machine data is argued to be able to help organizations troubleshoot problems, identify threats and use machine learning to help predict future issues.
5 – Spatiotemporal data
Spatiotemporal data describes both location and time for the same event — and it can show us how phenomena in a physical location change over time.
“Spatial data is the ‘spatio’ in spatiotemporal. It can describe point locations or more complex lines such as vehicle trajectories, or polygons (plane figures) that make up geographic objects like countries, roads, lakes or building footprints,” explained Todd Mostak, CEO of MapD.
Temporal data contains date and time information in a time stamp. Valid Time is the time period covered in the real world. Transaction Time is the time when a fact stored in the database was known.
“Examples of how analysts can visualize and interact with spatiotemporal data include: tracking moving vehicles, describing the change in populations over time, or identifying anomalies in a telecommunications network. Decision-makers can also run backend database calculations to find distances between objects or summary statistics on objects contained within specified locations,” said MapD’s Mostak.
6 – Open data
Open data is data that is freely available to anyone in terms of its use (the chance to apply analytics to it) and rights to republish without restrictions from copyright, patents or other mechanisms of control. The Open Data Institute states that open data is only useful if it’s shared in ways that people can actually understand. It needs to be shared in a standardized format and easily traced back to where it came from.
“Wouldn’t it be interesting if we could make some private data [shapes, extrapolated trends, aggregate values and analytics] available to the world without giving up the source and owner identification of that data? Some technologies are emerging, like multi-party computation and differential privacy that can help us do this,” said Mike Bursell, chief security architect at Red Hat.
Bursell explains that these are still academic techniques at the moment, but over the next ten years he says that people will be thinking about what we mean by open data in different ways. The open source world understands some of those questions and can lead the pack. The Red Hat security man says that it can be difficult for organizations that have built their business around keeping secrets. They now have to look at how they open that up to create opportunities for wealth creation and innovation.
7 – Dark data
Dark data is digital information that is not being used and lies dormant in some form.
Analyst house Gartner Inc. describes dark data as, “Information assets that an organization collects, processes and stores in the course of its regular business activity, but generally fails to use for other purposes.”
8 – Real time data
One of the most explosive trends in analytics is the ability to stream and act around real time data. Some people argue that the term itself is something of a misnomer i.e. data can only travel as fast as the speed of communications, which isn’t faster than time itself… so, logically, even real time data is slightly behind the actual passage of time in the real world. However, we can still use the term to refer to instantaneous computing that happens about as fast as a human can perceive.
“Trends like edge computing and the impending rise of 5G are gaining their momentum based upon the opportunities thrown up by real time data. The power of immediacy with data is going to be the catalyst for realizing smart cities,” said Daniel Newman, principal analyst at Chicago-based Futurum Research.
Newman says that real time data can help with everything from deploying emergency resources in a road crash to helping traffic flow more smoothly during a citywide event. He says that real time data can also provide a better link between consumers and brands allowing the most relevant offers to be delivered at precise moments based upon location and preferences. “Real time data is a real powerhouse and its potential will be fully realized in the near term,” added Newman.
9 – Genomics data
Bharath Gowda, vice president for product marketing at Databricks points at genomics data as another area that needs specialist understanding. Genomics data involves analysing the DNA of patients to identify new drugs and improve care with personalized treatments.
He explains, ”The data involved [in genomics] is huge – by 2020 genomic data is expected to be orders of magnitude greater than the data produced by Twitter and YouTube. The first genome took over a decade to assemble. Today, a patient’s genome can be sequenced in a couple of days. However, generating data is the easy part. Turning data into insight is the challenge. The tools used by researchers cannot handle the massive volumes of genomic data.”
What are the issues here? According to Gowda, data processing and downstream analytics are the new bottlenecks that stop us getting more value out of genomic data. So what makes genomic data different?
“It requires significant data processing and needs to be blended with data from hundreds of thousands of patients to generate insights. Furthermore, you need to look at how you can unify analytics workflows across all teams – from the bioinformatics professional prepping data to the clinical specialist treating patients – in order to maximize its value,” said Gowda.
10 – Operational data
Colin Fernandes is product marketing director for EMEA region at Sumo Logic. Fernandes says that companies have big data, they have application logs and metrics, they have event data, and they have information from microservices applications and third parties.
The question is: how can they turn this data into business insights that decision makers and non-technical teams can use, in addition to data scientists and IT specialists?
“This is where operational analytics comes into play,” said Fernandes. “Analyzing operational data turns IT systems data into resources that employees can use in their roles. What’s important here is that we turn data from a specialist resource into assets that can be understood by everyone, from the CEO to line of business workers, whenever they have a decision to make.”
Fernandes points out that in practice, this means looking at new applications and business goals together to reverse engineer what your operational data metrics should be. New customer-facing services can be developed on microservices, but how do we make sure we extract the right data from the start? By putting this ‘operational data” mindset in place, we can arguably look at getting the right information to the right people as they need it.
11 – High-dimensional data
High-dimensional data is a term being popularized in relation to facial recognition technologies. Due to the massively complex number of contours on a human face, we need new expressions of data that are multi-faceted enough to be able to handle computations that are capable of describing all the nuances and individualities that exist across out facial physiognomies. Related to this is the concept of eigenfaces, the name given to a set of eigenvectors when they are used in computing to process human face recognition.
12 – Unverified outdated data
The previously quoted Mike Bursell of Red Hat also points to what he calls unverified outdated data. This is data that has been collected, but nobody has any idea whether it’s relevant, accurate or even of the right type. We can suggest that in business terms, if you’re trusting data that you haven’t verified, then you shouldn’t be trusting any decisions that are made on its basis. Bursell says that Garbage In, Garbage Out still holds… and without verification, data is just that: garbage.
“Arguably even worse that unverified data, which may at least have some validity and which you should at least know that you shouldn’t trust, data which is out-of-date and used to be relevant. But many of the real-world evidence from which we derive our data changes, and if the data doesn’t change to reflect that, then it is positively dangerous to use it in many cases,” said Bursell.
13 – Translytic Data
An amalgam of ‘transact’ and ‘analyze’, translytic data is argued to enable on-demand real-time processing and reporting with new metrics not previously available at the point of action. This is the opinion of Mark Darbyshire, CTO for data and database management at SAP UK.
Darbyshire says that traditionally, analysis has been done on a copy of transactional data. But today, with the availability of in-memory computing, companies can perform ‘transaction window’ analytics. This he says supports tasks that increase business value like intelligent targeting, curated recommendations, alternative diagnosis and instant fraud detection as well as providing subtle but valuable business insights.
According to SAP’s Darbyshire, “Translytic data requires a simplified technology architecture and hybrid transactional analytic database systems, which are enabled by the in-memory technology. This also provides the added benefit of simplicity of architecture – one system to maintain with no data movement. Companies who transact in real time with instant insight into the relevant key metrics that matter while they transact experience increased operational efficiency as well as faster access and improved visibility into its real-time data.”
This list is by no means meant to be exhaustive, such is the nature of information technology and the proliferation of data