Book Review: Exploring Data with Rapidminer

Synopsis
: Data preparation and visualization within a popular software tool
Difficulty
: Beginner


Most books about a particular software tool will focus on the core capabilities of that tool, giving only cursory (if any) coverage about the housekeeping or preparatory functions. For example, most books I’ve read that discuss RapidMiner (and there aren’t many) spend 90% of their time on Operators and how to build a Process. That’s great, but there are steps that can and should be taken before ever getting to that point.
That’s where this book shines. You won’t find much here about linear regression or k-means or support vector machines. Rather, you’ll find plenty about looking at data, cleaning it, and visualization.
In a recent tutorial on using RapidMiner, I deliberately took the reader down a path where we encountered common errors in using the k-means Operator. We needed to transform data before the Operator could get its job done. Chisholm spends two entire chapters on that sort of thing in Exploring Data with RapidMiner (Chapter 4, “Parsing and Converting Attributes” and Chapter 7, “Transforming Data”). If you’re working with an imperfect dataset, which is the majority of the time in the real world, this information is like gold.
You also get coverage of visualizing data, which is a good habit to get into when first importing a dataset. Sometimes things jump out to the eye and give you inspiration for which direction to take the analysis. RapidMiner provides the basic tools for this, like histograms and scatter plots, and the author gives a good explanation of their use.
Finally, the reader is taken through instructions on overcoming resource constraints. Those dealing with extra-large datasets will particularly appreciate this, but unfortunately very little of it is directly applicable to the free downloadable version. You’ll have to upgrade to a paid version in order to analyze Big Data without resorting to some form of sampling. Once you do, Chisholm gives a fairly exhaustive explanation of how to keep RapidMiner from choking on your data.
Bottom Line:
This is definitely not a stand-alone work on RapidMiner, nor does it aspire to be. What it does, it does very well. Combine this book with another that focuses on Operators and you’ll have a complete look at how to use RapidMiner to best effect.