Beginning Machine Learning For Developers

Anup Singh
6 min readJan 31, 2022

Anaconda Navigator

Key Elements of Machine Learning

  • Representation
  • Evaluation (accuracy, prediction and recall, probability)
  • Optimization

Inductive Learning is where we are given examples of a function in the form of data (x) and the output of the function (f(x)). The goal of inductive learning is to learn the function for new data (x).

  • Classification: when the function being learned is discrete.
  • Regression: when the function being learned is continuous.
  • Probability Estimation: when the output of the function is a probability.

Python Libraries

  • matplotlib [ Charts ]
  • Seaborn [ Heat maps, charts ]
  • Scikit Learn [ Data mining, Data analysis ]
  • Pandas [ Data wrangling, data manipulation, aggregation and visualization ]
  • Numpy [ Numerical python, n-arrays and matrices, mathematical purposes ]

Types of ML

  • Supervised ( Training data set )
  • Reinforced ( Reward, learn from consequences of action )
  • Unsupervised (Unlabeled data -> Clusters)

Life cycle

Collecting Data -> Data Wrangling -> Analyse Data -> Train Algorithm -> Test Algorithm -> Deployment

Supervised Learning

  • Linear Regression [ y = a + bX]
  • Logistic Regression ( Categorical )
  • Decision Tree ( Classification-> Categorical/Continous Dependent)
  • Random Forest (Ensemble of decision tree, gives better prediction accuracy than Decision Tree)
  • Naive Bayes Classifier (Bayes’ Theorem with an assumption of independence between predictors)

Linear Regression

  • Dependent variable, Independent Variable -> Relations
  • Co-relation -> Gives mutual relation -> Plot using seaborn
  • Regression Line -> Best fits the data with minimum standard error (MSE)

Example

  • Pricing of Real state
  • Make predictions for next independent variable

Model Fitting

  • Underfit ->
  • Appropriate
  • Overfit -> Fits for training data only appropriately

Logistic Regression

Analysing the dataset in which there are one or more independent variables that determine an outcome.

  • Outcome -> Discrete

S-Curve (0 or 1)

How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day?

Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-regression/

Decision Tree

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter.

  • Leaves => decisions or the final outcomes.
  • Decision nodes are where the data is split.
  1. Classification trees (Yes/No types)
  2. Regression trees (Continuous data types)

Predict whether a person is fit given their information like age, eating habit, and physical activity, etc

Consider a piece of data collected over the course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day. Now, our job is to build a predictive model which takes in above 4 parameters and predicts whether Golf will be played on the day. We’ll build a decision tree to do that using ID3 algorithm.

Logistic Regression Using Decision Tree

from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
import matplotlib.pyplot at plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("data.csv", header=0)
print(data.head(6))
data.info()
data.drop("Unnamed: 32", axis-1, inplace=True)
data.columns
data.drop("id", axis-1, inplace=True)
data["diagnosis"] = data["diagnosis"].map({"M":1, "B": 0})
data.describe()
sns.countplot(data["diagnosis"], label="count")
corr = data.corr()
sns.heatmap(corr, .....)
prediction_var = ["texture_mean", "perimeter_mean", "smoothness_mean", "compactness_mean", "symmetry_mean"]train, test = train_test_split(data, test_size = 0.3)train_X = train[prediction_var]
train_y = train.diagnosis
logistic = LogisticRegression()
logistic.fit(train_X, train_y)
temp = logistic.predict(test_X)
metrics.accuracy_score(temp, test_y)
clf = DecisionTreeClassifier(random_state=0)
cross_val_score(clf, train_X, train_y, cv=10)
clf.fit(train_X, train_y, sample_weight=None, check_input=True, X_idx_sorted=None)
clf.get_params(deep=True)
clf.predict(test_X, check_input=True)
clf.score(test_X, test_y, sample_weight=None))

Random Forest

Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.

Ensemble classifier made using many Decision tree models

Combines the result from different models.

How to Choose a Machine Learning Model ?

multiclass or multinomial classification is the problem of classifying instances into one of three or more classes

Naive Bayes Theorem

Classification Technique

P(H|E) = P(E|H).P(H) / P(E)

News Categorization

Spam Filtering

Weather Predictions

Run or walk (date, time, username, wrist, gyro_x, gyro_y, gyro_z)

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import consusion_matrix, classification_report

Scikit Learn

from sklearn.linear_model import LinearRegression

y = df[“price”]

x = df[[“room_num”]]

ln = LinearRegression()

ln.fit(x, y)

print(ln.intercept_, ln.coef_)

help(ln)

ln.predict(x)

sns.jointplot(x=df[‘room_num’], y = df[“price”], data=df, kind=”reg”)

  • Logistic Regression
  • train test split
  • metrics
  • Decision tree classifier -> Test accuracy and model
  • Random Forest

Data Preprocessing

Univariate Analysis

  • Mean
  • Median
  • Mode
  • Quartiles
  • Standard Deviation
  • Variance

EDD => Presence of outliers -> df.describe()

skewed ?? has outliers ??

Plot to see outliers (Jointplot, countplot)

Observations from EDD

  • Missing values
  • Skewness or outliers
  • variables with no significance

Outlier Treatment

  • Capping and Flooring

Seasonality in Data

  • Normallize (m = u(year) / u(month))
  • d = d(p) * m

Bivariate Analysis

  • Scatter plot [ keep, discard and transform variables ]
  • Correlation [ X1, X2 ]

Variable Transformation

  • Mean of multiple variables
  • Ratio variables
  • Log, exponential to make it linear

Dummy variable -> Categorical value

df.get_dummies(df)

Coorelation vs Causation

Classification Models

  • Logisitic Regression

The dependent variable must be categorical in nature.

The independent variable should not have multi-collinearity.

  • Linear Discriminant Analysis
  • K Nearest Neighbour (Regression as well as for Classification but mostly it is used for the Classification problems)
  • SVM (Support Vector Machine) (Classification as well as Regression)

There is a Car manufacturer company that has manufactured a new SUV car. The company wants to give the ads to the users who are interested in buying that SUV.

Logistic Regression

Whether the house will be sold within 3 months or not.

from sklearn.linear_model import LogisticRegressionx = df[[“price”]]y = df[“Sold”]clf_lrs = LogisticRegression()clf_lrs.fit(x, y)print(clf_lrs.intercept_, clf_lrs.coef_)help(ln)ln.predict(x)sns.jointplot(x=df[‘room_num’], y = df[“price”], data=df, kind=”reg”)

Unsupervised Learning

Distribute the data into clusters consisting of similar data points.

Intrinsic grouping in a set of unlabelled data

Marketing

  • Discovering distinct groups in customer data set
  • Customer who use internet rather than calls
  • Risky customers (insurance, loan)

Clustering

Distance from centroid

Euclidian distance

How to decide the number of Clusters

  • Elbow method (SSE)

Types of Clustering

  • Exclusive Clustering (Item belong to only one cluster) (K-Means Clustering)
  • Overlapping Clustering (Item can belong to multiple clusters) (C-Means Clustering)
  • Hierarchical Clustering ( Cluster with parent child relationships)
matplot -> ggplot
from sklearn.cluster import KMeans

C-Means Clustering

Soft clustering

Hierarchial Clustering

Once a decision is made to combine two clusters, it can’t be undone

Too slow for large datasets

Market Basket Analysis

Association Rule Mining

  • Technique that shows how items are associated to each other.
  • Laptop => Laptop Bags

Support, Confidence, Lift

  • s = frep(A, B) / N (number of transactions which contains both A and B)
  • c = freq(A, B) / N (number of times A, B occurs given no. of times A occurs)
  • l = s / (s(A) * s(B))

A, B, C, D, E

Apriori Algorithm

Confusion Matrix

--

--