Introduction to RapidMiner Part 3

Now that you know how to import data and examine it, it’s time to get to the meat of RapidMiner: building a Process. And when you’re building a Process, the Design screen becomes really important.
Open RapidMiner, and if it doesn’t go straight to the Design page, click on the Design tab at the top right.
We’re going to work with a different data set than before, so download that by right-clicking here. Go through the process from our previous tutorial to pull that into your data repository. Once it’s there, click and drag the new dataset into the main work area and we’re in business.

In order to get through this tutorial we’ll first need to talk about Operators. The idea behind Operators is that they are little “black boxes” that perform a given function on the data you send through them. What makes RapidMiner so powerful is that you can combine these Operators (and the functions they perform) in so many different permutations. You can run them one after another in series, line them up to run in parallel, or any combination of the two.
Operators come in several varieties. There are ones that change your data based on criteria you specify (for example, if you need to clean data prior to analysis). There are Operators to output your data in various formats. And of course there are Operators that relate to statistical models: Regressions, Classifiers, etc.
Take a moment to browse through the folders in the “Operators” pane at the top left, and you’ll get a good idea of the various functions RapidMiner can perform on your data. When you realize that you can stack these in almost any logical combination, it should sink in how powerful a tool this really is. Go ahead and browse; I’ll wait.
Done? Good. What we’re going to do today is take the customer data from our new dataset and use it “segment” our customer base; to separate them into clusters based on things they have in common. Why would we want to do that? More efficient marketing comes to mind. Rather than send a mailer to everyone, why not just send it to the group of people who are most like to use it?
In order to segment our customers, we’ll need an Operator that performs some means of clustering. K-means is good for this, so we’ll go into the “Modeling –> Clustering and Segmentation” folder and drag the K-means Operator into the main design screen, like so:

Now pull a line from the output of the “Retrieve” Operator and to the input of the “Clustering” Operator. Do the same with the top output of “Clustering” and the “res” nub at the top right of the design window.
In the “Parameters” window tell the Operator we want six clusters (the “k” field). We’ll also tell it to add a “Cluster” attribute in the data to identify which cluster RapidMiner decided each row belongs to.

Click the blue “play” triangle at the top of the page to get the process to run, and……..oops. We’ve got a problem:

K-means expects numerical data to work with, and our ID field is not numerical. We don’t really need the ID field for what we’re doing, so we can just get rid of it. Pull the “Select Attributes” operator into the design area and drop it between our data and the Clustering Operator. The connections should automatically adjust. In the “Parameters” window at the top right, select “Single” in “attribute filter type” since we only want to select one attribute, then choose “id” to select it. Since we want to exclude ID rather than include it, check the “invert selection” box. Now the “ID” field will be missing from the data we pass on to the K-means Operator. Problem solved. Push the Play button to run the process again.

OK, still not quite there. Turns out K-means doesn’t like terms such as “Male”, “Female”, “Yes” and “No” either. We need those, so we can’t just exclude them. How do we fix this?
We’ll need to transform that data from text to numbers (for example, 1 for Yes, 0 for No). To do that we’ll use the “Nominal to Numerical” Operator. Drag and drop it between the “Select Attributes” and “Clustering” Operators and again the connections should automatically adjust. Your Process should now look like this:

Hit the blue play button to run the process and this time it should work. You’ll be taken to the results screen where you can see what K-means has done with our data:

We can see that RapidMiner formed six clusters just as we asked, but how did it group our customers? What criteria did it use? In order to discover that we’ll need to click on the “Centroid Table” tab to the left.

Now we can see more detail about our six clusters. For example, Cluster 3 is composed of mostly married, inner-city dwelling males who all have savings accounts. Age seems to play a major role in all of our clusters, as the average age differs dramatically across them.
Congratulations, you’ve just segmented a customer base. See what else you can do by experimenting with various Operators. Have fun!