Introduction to RapidMiner Part 2

Now that we have RapidMiner downloaded and installed, it’s time to import a dataset so we can begin to examine it. Hopefully the quick overview of the GUI in our last post has given you enough to be able to navigate with some degree of effectiveness, but I’ll go over it a little more this time around.
First you’ll want to download the sample sales dataset to your desktop by clicking here. Then open RapidMiner and select “Import CSV” in the repository section.

Select the sales data CSV you just downloaded and click “Next”.

Here is where RapidMiner gets a little brain-dead. Despite the fact that you are importing a CSV (COMMA separated value) file, RapidMiner assumes the separator is a semicolon. This will make your imported data useless if you don’t correct it, so click on the “Comma” radio button to change the separator then click “Next”.

The next screen will ask you to name your attribute types, but for the purposes of this tutorial we’ll skip that. Click “Next”.
On the next screen we are asked about the types of our attributes. For some reason RapidMiner thinks that all of our data falls into the “Polynomial” type, which is absolutely incorrect. We’ll need to manually change those to the correct types, so for every column that is text choose “text”, every column that has a date and time select “date_time”, and every column that has number values select “integer.” We won’t be using the Latitude and Longitude columns so just deselect those at the top. Once again, click “Next.”

Step 5 asks you to select a Repository to save your data to and give the dataset a name. Once you’ve done that click “Finish” and you should be taken to the Results page of RapidMiner, which looks like this:

Before we do any analysis of data using Processes, it’s always a good idea to have a look to see what we’re working with. The Results page has some good tools for doing just that, so let’s use them. Click on “Statistics” in the left-hand side of the page and you should see this:

As you can see, the Statistics pane gives us a glimpse into what kind of data we have. We can see the dates our first and last orders were entered, what cities have the most orders and which have the least, and other summary statistics. See ing this information in text form is good, but to see it in a visualization we’ll need to click on “Charts”. We’ll look at how product sales are going, so select “Product” under the “Group-by” and “Legend” columns and “Price” under the value column. Here’s the result:
Obviously Product1 is a big seller for this company.
Now you know how to create a data repository, import data, and view summary statistics about that data. Next time we’ll get into the real meat of RapidMiner; creating a process. See you soon.