This is a guest post by Michael G. Solomon, PhD CISSP PMP CISM.
“Big data” is a popular term that finds its way into many of today’s conversations. Although it has come to mean any large collection of data, the generally agreed-upon definition is a bit broader.
According to the Gartner IT Glossary, “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.” The trick is unlocking any secrets that these information assets contain.
Most traditional software testing focuses on validating that the application functionality is correct. In many cases, the data itself isn’t central to the testing approaches; the focus is on the software.
In contrast, testing big data applications is equally focused on evaluating and validating the data, the performance of the software, and the quality of the output that the software produces. Big data applications involve not only vast amounts of data but also analyzing that data in the most efficient manner.
Let’s explore some approaches to validating the functionality, quality, and performance of your big data application.
Identifying Testing Challenges
Big data analytics projects are all about extracting actionable information from vast amounts and sources of data. That may sound simple, but it really combines three interrelated activities: collecting good data, analyzing that data and creating meaningful output. Any of these three activities, if done poorly or improperly, can deliver results that don’t drive good business decisions.
The question is, how can you ensure that your big data projects produce valuable output? While you can never guarantee 100% accuracy, you can promote the highest accuracy possible through comprehensive and frequent testing.
Since big data analytics activities are dependent on large sets of data, you must test more than just the analytics models. That makes big data testing more difficult and more important than testing approaches for traditional applications.
You must make testing activities a fundamental part of the overall design of any big data project, and testing must encompass all activities, from data acquisition through final results output. Comprehensive and aggressive testing throughout the project will help ensure the highest accuracy and provide better information on which business decisions should be made.
Developing Strategies for Testing
Big data testing is really a series of activity-based testing strategies. Every big data project collects data, analyzes data, and then produces output that translates analysis findings into actionable information. Testing is important for each activity, and you should tailor the testing approach to meet the needs of each activity.
Evaluating Input Data
Validating input data is an often-underestimated activity, but it is required to ensure that the data you collect is the data you need and that it comes from the right sources. And it’s not trivial work; this step frequently consumes much of the project’s schedule. One of the nuances of big data projects is that data input sources can be widely varied in format and origin. Your data may come from legacy databases, flat files, web services, autonomous sensors or even media files.
This evaluation of input data is sometimes called pre-Hadoop testing, because the Hadoop framework is commonly used to carry out much of the analysis in the next step.
The analysis phase depends on the quality of input data; the adage “Garbage in, garbage out” applies here. The first crucial activity in big data application testing is to validate that the input data is correct and aligned with the application’s requirements. And this isn’t a one-off activity — it must be carried out perpetually, since big data tends to change rapidly as time passes.
Once you trust your input data, the next step is to carry out the analysis activities. After you build the various analytics models you’ll use in your application, it is important to test the accuracy of each model, initially and periodically afterward. Because big data changes rapidly, models that provide results with high accuracy at one point in time may degrade as input data changes.
A good model is generally one that exhibits low sensitivity to input data — that means a good model should work well regardless of the data it receives. However, real-life data is rarely well-behaved and tends to resist stability under static models. That’s just a fancy way of saying that real data can’t easily be represented with a simple model. For that reason, it is important to continuously evaluate how well a model continues to work as input data changes.
Another aspect of verifying analytics through testing is evaluating how well your software performs. A great analysis model that takes three weeks to execute probably isn’t a good model for an operational environment. Accuracy is important, but providing accurate results in an optimal time frame is even better. An important aspect of big data application testing is aligning model runtimes with project goals. If execution times exceed goals, additional software development and optimization may be necessary.
The final stage of testing any big data application is the creation of output data files. Output files are used to present results and create visualizations to stakeholders.
Testing at this stage is necessary to verify the integrity of the output data and validate that the results align with the models’ output. In effect, validating output is the process of testing the presentation and visualization procedures. This testing activity ensures that model results map consistently to presented outcomes.
Pulling It All Together
Big data application testing should not be an afterthought or a one-time activity. To effectively evaluate how well your big data application meets its design goals, testing must be part of the application’s design and operation. The key factors of a successful testing regimen are to test your data, test your models, test your results, and then do it all again.
Input data changes may affect the quality of your model’s output. Without aggressive testing, you may miss critical business indicators. Your models could degrade to a point that they just are no longer cost-effective. Proper full-cycle analytics testing can have a direct impact on whether your big data application is a business liability or profitable asset.
Article written by Michael Solomon PhD CISSP PMP CISM, Professor of Information Systems Security and Information Technology at University of the Cumberlands.