Skip to content
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS
Close
Beyond Knowledge Innovation

Beyond Knowledge Innovation

Where Data Unveils Possibilities

  • Home
  • AI & ML Insights
  • Machine Learning
    • Supervised Learning
      • Introduction
      • Regression
      • Classification
    • Unsupervised Learning
      • Introduction
      • Clustering
      • Association
      • Dimensionality Reduction
    • Reinforcement Learning
    • Generative AI
  • Knowledge Base
    • Introduction To Python
    • Introduction To Data
    • Introduction to EDA
  • References
HomeImplementationProcess of Fitting the models in machine learning
Implementation

Process of Fitting the models in machine learning

January 16, 2024January 16, 2024CEO 232 views

The steps to follow to use machine learning models are:

  • Import libraries you need to work with in your project
  • Load your dataset
  • Split the dataset to train, and test sets. The goal is to train your model on the training sets, and compute the accuracy of the model on the test sets, which was not discovered yet by the model (to be the most realistic)
  • Normalize your data train, and then infer this transformation to the test sets
  • Fit the model
  • Predict
  • Evaluate the model

In “fit” and “predict” steps, you can use several models, and evaluate them, to keep the most performing one.

Python libraries:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize, Normalizer
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import r2_score, explained_variance_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error, median_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns

Here, we train a model to guess a comfortable boot size for a dog, based on the size of the harness that fits them:

import pandas

# Let's make a dictionary of data for boot sizes
# and harness sizes in cm
data = {
    'boot_size' : [ 39, 38, 37, 39, 38, 35, 37, 36, 35, 40, 
                    40, 36, 38, 39, 42, 42, 36, 36, 35, 41, 
                    42, 38, 37, 35, 40, 36, 35, 39, 41, 37, 
                    35, 41, 39, 41, 42, 42, 36, 37, 37, 39,
                    42, 35, 36, 41, 41, 41, 39, 39, 35, 39
 ],
    'harness_size': [ 58, 58, 52, 58, 57, 52, 55, 53, 49, 54,
                59, 56, 53, 58, 57, 58, 56, 51, 50, 59,
                59, 59, 55, 50, 55, 52, 53, 54, 61, 56,
                55, 60, 57, 56, 61, 58, 53, 57, 57, 55,
                60, 51, 52, 56, 55, 57, 58, 57, 51, 59
                ]
}

# Convert it into a table using pandas
dataset = pandas.DataFrame(data)

# Print the data
dataset.head()

boot_size	harness_size
0	39	58
1	38	58
2	37	52
3	39	58
4	38	57

Let’s take a very simple model called OLS. This is just a straight line (sometimes called a trendline).

# Load a library to do the hard work for us
import statsmodels.formula.api as smf

# First, we define our formula using a special syntax
# This says that boot_size is explained by harness_size
formula = "boot_size ~ harness_size"

# Create the model, but don't train it yet
model = smf.ols(formula = formula, data = dataset)

OLS models have two parameters (a slope and an offset), but these haven’t been set in our model yet. We need to train (fit) our model to find these values so that the model can reliably estimate dogs’ boot size based on their harness size.

The following code fits our model to data you’ve now seen:

# Train (fit) the model so that it creates a line that 
# fits our data. This method does the hard work for us. 
fitted_model = model.fit()

# Print information about our model now it has been fit
print("The following model parameters have been found:\n" +
        f"Line slope: {fitted_model.params[1]}\n"+
        f"Line Intercept: {fitted_model.params[0]}")
The following model parameters have been found:
Line slope: 0.585925416738271
Line Intercept: 5.71910981268259

Notice how training the model set its parameters. We could interpret these directly, but it’s simpler to see it as a graph:

import matplotlib.pyplot as plt

# Show a scatter plot of the data points and add the fitted line
plt.scatter(dataset["harness_size"], dataset["boot_size"])
plt.plot(dataset["harness_size"], fitted_model.params[1] * dataset["harness_size"] + fitted_model.params[0], 'r', label='Fitted line')

# add labels and legend
plt.xlabel("harness_size")
plt.ylabel("boot_size")
plt.legend()

The graph above shows our original data as circles with a red line through it. The red line shows our model.

Now that we’ve finished training, we can use our model to predict a dog’s boot size from their harness size.

# harness_size states the size of the harness we are interested in
harness_size = { 'harness_size' : [52.5] }

# Use the model to predict what size of boots the dog will fit
approximate_boot_size = fitted_model.predict(harness_size)

# Print the result
print("Estimated approximate_boot_size:")
print(approximate_boot_size[0])
Estimated approximate_boot_size:
36.48019419144182
fitting, python, training

Post navigation

Previous Post
Previous post: Feature Engineering: Scaling, Normalization, and Standardization
Next Post
Next post: How to create a smaller dataset for a specific month

You Might Also Like

No image
Delete a folder in Google Colab
June 20, 2024 Comments Off on Delete a folder in Google Colab
No image
CDF plot of Numerical columns
March 12, 2024 Comments Off on CDF plot of Numerical columns
No image
Python warnings module
March 3, 2024 Comments Off on Python warnings module
No image
How-to: give a specific sorting order to…
February 7, 2024 Comments Off on How-to: give a specific sorting order to categorical values
No image
How-to: When missing data is of type…
February 6, 2024 Comments Off on How-to: When missing data is of type categorical
  • Recent
  • Popular
  • Random
  • No image
    7 months ago Low-Rank Factorization
  • No image
    8 months ago Perturbation Test for a Regression Model
  • No image
    8 months ago Calibration Curve for Classification Models
  • No image
    March 15, 20240Single linkage hierarchical clustering
  • No image
    April 17, 20240XGBoost (eXtreme Gradient Boosting)
  • No image
    April 17, 20240Gradient Boosting
  • No image
    March 11, 2024What is Jaccard Distance
  • No image
    March 1, 2024Forward Feature Selection using SequentialFeatureSelector
  • No image
    March 10, 2024Choosing the right estimator
  • Implementation (55)
    • EDA (4)
    • Neural Networks (10)
    • Supervised Learning (26)
      • Classification (17)
      • Linear Regression (8)
    • Unsupervised Learning (11)
      • Clustering (8)
      • Dimensionality Reduction (3)
  • Knowledge Base (44)
    • Python (27)
    • Statistics (6)
June 2025
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
30  
« Oct    

We are on

FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS

Subscribe

© 2025 Beyond Knowledge Innovation
FacebookTwitterLinkedinYouTubeGitHubSubscribeEmailRSS