This is a guest post by Michael G. Solomon, PhD CISSP PMP CISM.
Data science promises the ability to extract valuable knowledge from data. But with the staggering amount of data available to us, figuring out what it all means is getting harder. Data scientists are able to analyze past behavior data and use it to predict future actions. In some cases, they can even determine how to modify behavior to achieve a desired outcome.
Despite the attention data science gets, does it really deliver, or is it just another cool buzzword? Let’s look at what data science is, how it impacts today’s organizations, and how it is changing the way we use technology.
The Science of Big Data
Data science is a multidisciplinary field that draws on expertise from different areas, including statistics, technology, and business, to extract previously hidden knowledge from vast amounts of data. In short, the goal is to make sense of all that data we’re collecting.
As nearly every aspect of life becomes more reliant on data, the amount of data we’re collecting is so large — in fact, big data is the term we use — that it is hard to really understand. Visual Capitalist reports that the sum of all stored digital data will reach 44 zettabytes by next year. In bytes, that’s 44 with 21 zeros behind it! Put another way, a zettabyte is equal to 1 trillion gigabytes.
If all we do as individuals and organizations (and governments) is store data, we’re little more than digital hoarders. Storing that much data “just in case” we need it one day doesn’t make much sense. A better way to look at data is to understand the value it contains. That’s what data science can give us. It isn’t as simple as clicking a few buttons, but with a little work, that vast data repository can reveal many secrets.
As humans, we tend to be creatures of habit. Data science is a general approach to looking at large amounts of data to uncover trends and associations in order to predict future outcomes. Understanding what customers, partners or even competitors may do next can be a huge strategic advantage to any organization.
Making Use of the Explosion of Data
Data analysis has been around for a long time. From the very beginning of general computing, business owners of expensive computing devices have wanted to know how that data can benefit them.
Several things happened that forced simple data analysis to mature into its own domain of data science. The launch of the World Wide Web’s first browser in the early 1990s started the shift. Individuals and organizations began to realize the value of expanded connectivity. It didn’t take long to realize that tracking online activities could be used to influence human behavior. Recommendation engines emerged and became a common feature of websites of all types.
At first, data was held in silos by each separate organization. Soon after, many realized that collecting data for later sale could be a lucrative online endeavor. That led to an explosion of ways to collect and process data. Today, everything you type and click on is recorded and likely stored in some organization’s databases. Your activity has real value to many organizations that want to sell you products or services.
Technology advances led to higher and more available internet bandwidth, along with more powerful computers and mobile devices. These advances, in turn, led to more and more applications that generate data to be stored. Technical data analysis simply couldn’t keep up. The increasing velocity of data growth, data variety, data sources and business objectives demanded a new approach. Data science emerged as the solution to this complex problem.
Data science grew from the combination of statistics to analyze data, technology to acquire and cleanse data so it can be analyzed, and the business unit to understand how data applies to business objectives. Data science is coming into its own, and many programs now exist to educate the next generation of data scientists.
Testing a Moving Target
Like it or not, data science is likely to be a part of most organizations in some form. Organizations that ignore the valuable secrets hidden in their data will find themselves less prepared than data-savvy competitors. Understanding the value not only of data, but also the process to uncover those secrets, is crucial to strategic success.
However, data science projects, models and tools are fundamentally different from other technical endeavors in one important way: It is all based on dynamically changing data. Most other technical projects produce software or hardware outcomes that are relatively stable. Testing validates that the product works as required and, barring unforeseen faults, will continue to operate as required.
Data science models are different. Testing is necessary throughout the active project, but it must also continue after operationalization. Just because analysis models produce valid output today doesn’t mean that they’ll continue to provide output that you can trust. Substantial changes to the underlying data change how well models can explore, describe and predict outcomes. When it comes to data science initiatives, a tester’s work is never really done.
That means the traditional testing role must take on a new dimension. Testing data analytics for big data requires ongoing validation to ensure the models are still appropriate as the data changes. The frequency of validation depends on the criticality of the model output and the velocity of changing data.
The rapidly expanding universe of data is often seen as a great opportunity for organizations to leverage their data. That can only happen when those organizations build data science project teams that are prepared to face all aspects of complex projects – including post-operationalization validation.
Data science isn’t magic, it isn’t just the latest trend, and it isn’t yet the status quo. It is the key to ongoing organizational success.
Article written by Michael Solomon PhD CISSP PMP CISM, Professor of Information Systems Security and Information Technology at University of the Cumberlands.