Random Forests, a type of classification method, are widely adopted in Data Science for good reason. Random Forests in Python are particularly easy to implement and are one of the most powerful methods for classification available in the language.
Any discussion of Random Forests must first begin with Decision Trees, as Random Forests are nothing more than multiple Decision Trees whose parameters have been slightly altered from one another (hence the name).
Decision Trees are remarkably simple to understand; you use a mental version of them every day without realizing it. Let’s say you want to decide what to do this weekend; the process might look something like this:
The first decision criteria might be the weather; is it raining or not? If it is, the next decision point is probably the temperature. If it’s raining and it’s hot, you decide to stay home. If it’s raining but the temperature is mild, you might venture outside. But where? That brings us to the next criteria, which is the status of your refrigerator. If you don’t have groceries a trip to the store is in order, but if you do, some time with a good book at your favorite coffee shop might be more appropriate.
Decision Trees are simple to understand, but unfortunately they can easily overfit data. To counter this, Random Forests were invented. Random Forests are made of multiple decision trees with their input data and parameters tunings slightly different from one another. Not only does this minimize the possiblity of overfit, it also improves the reliability of the results, making Random Forests a very popular choice for classification.
Coding Random Forests in Python
#First import our libraries,
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# Load the CSV with Pandas
df = pd.read_csv(‘data.csv’)
#Define the Target and the Features (X and y)
X = df.drop(‘columnname’, axis=1) #drop the target column
y = df[[‘columnname’]] #keep only the target column
#Create our cross-validation split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
#Define the algorithm and train the model
forest = RandomForestClassifier(n_estimators=100, random_state = 42)