Projects > Basic Machine Learning

Basic Machine Learning

Description

This project explores some basic concept in machine learning.

# Load Data
import pandas as pd

# Path of the file to read
data_file_path = '../input/train.csv'
my_data = pd.read_csv(data_file_path)

# Print summary statistics in next line
my_data.describe()

# List all columns in the dataset
my_data.columns

# dropna drops missing values (think of na as "not available")
my_data = my_data.dropna(axis=0)

# Select a prediction target
y = my_data.column_x

# Select features
my_data_features = ['column_m', 'column_a', 'column_z', 'column_c', 'column_e']
X = my_data[my_data_features]
print(X.describe())
print(X.head()) # Inspect the data at this stage to spot oddities
# Next use the scikit-learn library to create your model.
# There are four stages to building and using a model: Define, Fit, Predict, and Evaluate.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Split data into training and validation subsets to avoid "In-Sample" scores
# Random state guarantees a consistent split with each run
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Define the model
my_model = DecisionTreeRegressor(random_state = 1)

# Fit the model
my_model.fit(train_X, train_y)

# Make predictions
print("Making initial predictions:")
print(val_X.head())
print("The predictions are")
print(my_model.predict(val_X.head()))

# Compare predictions
print("Check actual values")
print(val_y.head())
print("Compare those to the predictions")
print(my_model.predict(val_X.head()))
# Calculate the mean absolute error
from sklearn.metrics import mean_absolute_error

val_model_predictions = my_model.predict(val_X)

# print the top few validation predictions and actual prices from validation data
print(val_model_predictions[:5])
print(val_y.head())

print(mean_absolute_error(val_y, val_model_predictions))

The number of splits, or depth, of a decision tree is a key parameter that controls how complex the model is allowed to be. The depth would never need to exceed x, where 2x equals number of training data points. This will almost certainly lead to overfitting because each leaf could represent a single training data point and the lowest Mean Absolute Error is likely to be a lower depth value.

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)
# A loop can help find a minimum Mean Absolute Error
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))