Mike Smith

Basic Machine Learning

Description

This project explores some basic concept in machine learning.

# Load Data
import pandas as pd

# Path of the file to read
data_file_path = '../input/train.csv'
my_data = pd.read_csv(data_file_path)

# Print summary statistics in next line
my_data.describe()

# List all columns in the dataset
my_data.columns

# dropna drops missing values (think of na as "not available")
my_data = my_data.dropna(axis=0)

# Select a prediction target
y = my_data.column_x

# Select features
my_data_features = ['column_m', 'column_a', 'column_z', 'column_c', 'column_e']
X = my_data[my_data_features]
print(X.describe())
print(X.head()) # Inspect the data at this stage to spot oddities

# Next use the scikit-learn library to create your model.
# There are four stages to building and using a model: Define, Fit, Predict, and Evaluate.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Split data into training and validation subsets to avoid "In-Sample" scores
# Random state guarantees a consistent split with each run
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Define the model
my_model = DecisionTreeRegressor(random_state = 1)

# Fit the model
my_model.fit(train_X, train_y)

# Make predictions
print("Making initial predictions:")
print(val_X.head())
print("The predictions are")
print(my_model.predict(val_X.head()))

# Compare predictions
print("Check actual values")
print(val_y.head())
print("Compare those to the predictions")
print(my_model.predict(val_X.head()))

# Calculate the mean absolute error
from sklearn.metrics import mean_absolute_error

val_model_predictions = my_model.predict(val_X)

# print the top few validation predictions and actual prices from validation data
print(val_model_predictions[:5])
print(val_y.head())

print(mean_absolute_error(val_y, val_model_predictions))

The number of splits, or depth, of a decision tree is a key parameter that controls how complex the model is allowed to be. The depth would never need to exceed x, where 2^x equals number of training data points. This will almost certainly lead to overfitting because each leaf could represent a single training data point and the lowest Mean Absolute Error is likely to be a lower depth value.

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# A loop can help find a minimum Mean Absolute Error
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))