Beginning Machine Learning For Developers

Anup Singh

6 min readJan 31, 2022

Anaconda Navigator

Installation - Anaconda documentation

Review the system requirements listed below before installing Anaconda Individual Edition. If you don't want the…

docs.anaconda.com

Key Elements of Machine Learning

Representation
Evaluation (accuracy, prediction and recall, probability)
Optimization

Inductive Learning is where we are given examples of a function in the form of data (x) and the output of the function (f(x)). The goal of inductive learning is to learn the function for new data (x).

Classification: when the function being learned is discrete.
Regression: when the function being learned is continuous.
Probability Estimation: when the output of the function is a probability.

Python Libraries

matplotlib [ Charts ]
Seaborn [ Heat maps, charts ]
Scikit Learn [ Data mining, Data analysis ]
Pandas [ Data wrangling, data manipulation, aggregation and visualization ]
Numpy [ Numerical python, n-arrays and matrices, mathematical purposes ]

Types of ML

Supervised ( Training data set )
Reinforced ( Reward, learn from consequences of action )
Unsupervised (Unlabeled data -> Clusters)

Life cycle

Collecting Data -> Data Wrangling -> Analyse Data -> Train Algorithm -> Test Algorithm -> Deployment

Data wrangling -> filtering, cleaning, feature engineering (https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10)
Analyze -> Create models
Training -> Check Accuracy
Deployment -> Operation and Optimization
Retrain -> based upon accuracy
Prediction

Supervised Learning

Linear Regression [ y = a + bX]
Logistic Regression ( Categorical )
Decision Tree ( Classification-> Categorical/Continous Dependent)
Random Forest (Ensemble of decision tree, gives better prediction accuracy than Decision Tree)
Naive Bayes Classifier (Bayes’ Theorem with an assumption of independence between predictors)

Linear Regression

Dependent variable, Independent Variable -> Relations
Co-relation -> Gives mutual relation -> Plot using seaborn
Regression Line -> Best fits the data with minimum standard error (MSE)

Example

Pricing of Real state
Make predictions for next independent variable

Model Fitting

Underfit ->
Appropriate
Overfit -> Fits for training data only appropriately

Logistic Regression

Analysing the dataset in which there are one or more independent variables that determine an outcome.

Outcome -> Discrete

S-Curve (0 or 1)

How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day?

Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-regression/

Decision Tree

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter.

Leaves => decisions or the final outcomes.
Decision nodes are where the data is split.

Classification trees (Yes/No types)
Regression trees (Continuous data types)

| Xoriant

Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what…

www.xoriant.com

Machine Learning Decision Tree Classification Algorithm - Javatpoint

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but…

www.javatpoint.com

Predict whether a person is fit given their information like age, eating habit, and physical activity, etc

Consider a piece of data collected over the course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day. Now, our job is to build a predictive model which takes in above 4 parameters and predicts whether Golf will be played on the day. We’ll build a decision tree to do that using ID3 algorithm.

Logistic Regression Using Decision Tree

from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
import matplotlib.pyplot at plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifierdata = pd.read_csv("data.csv", header=0)
print(data.head(6))
data.info()data.drop("Unnamed: 32", axis-1, inplace=True)
data.columns
data.drop("id", axis-1, inplace=True)
data["diagnosis"] = data["diagnosis"].map({"M":1, "B": 0})
data.describe()sns.countplot(data["diagnosis"], label="count")
corr = data.corr()
sns.heatmap(corr, .....)prediction_var = ["texture_mean", "perimeter_mean", "smoothness_mean", "compactness_mean", "symmetry_mean"]train, test = train_test_split(data, test_size = 0.3)train_X = train[prediction_var]
train_y = train.diagnosislogistic = LogisticRegression()
logistic.fit(train_X, train_y)
temp = logistic.predict(test_X)
metrics.accuracy_score(temp, test_y)clf = DecisionTreeClassifier(random_state=0)
cross_val_score(clf, train_X, train_y, cv=10)
clf.fit(train_X, train_y, sample_weight=None, check_input=True, X_idx_sorted=None)
clf.get_params(deep=True)
clf.predict(test_X, check_input=True)
clf.score(test_X, test_y, sample_weight=None))

Random Forest

Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.

Ensemble classifier made using many Decision tree models

Combines the result from different models.

Machine Learning Random Forest Algorithm - Javatpoint

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used…

www.javatpoint.com

How to Choose a Machine Learning Model ?

How to Choose a Machine Learning Model - Some Guidelines - DataScienceCentral.com

In this post, we explore some broad guidelines for selecting machine learning models The overall steps for Machine…

www.datasciencecentral.com

When to use Random Forest over SVM and vice versa?

begingroup$ I would say, the choice depends very much on what data you have and what is your purpose. A few "rules of…

datascience.stackexchange.com

multiclass or multinomial classification is the problem of classifying instances into one of three or more classes

Naive Bayes Theorem

Classification Technique

P(H|E) = P(E|H).P(H) / P(E)

News Categorization

Spam Filtering

Weather Predictions

Run or walk (date, time, username, wrist, gyro_x, gyro_y, gyro_z)

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import consusion_matrix, classification_report

Scikit Learn

from sklearn.linear_model import LinearRegression

y = df[“price”]

x = df[[“room_num”]]

ln = LinearRegression()

ln.fit(x, y)

print(ln.intercept_, ln.coef_)

help(ln)

ln.predict(x)

sns.jointplot(x=df[‘room_num’], y = df[“price”], data=df, kind=”reg”)

Logistic Regression
train test split
metrics
Decision tree classifier -> Test accuracy and model
Random Forest

Data Preprocessing

Univariate Analysis

Mean
Median
Mode
Quartiles
Standard Deviation
Variance

EDD => Presence of outliers -> df.describe()

skewed ?? has outliers ??

Plot to see outliers (Jointplot, countplot)

Observations from EDD

Missing values
Skewness or outliers
variables with no significance

Outlier Treatment

Capping and Flooring

Seasonality in Data

Normallize (m = u(year) / u(month))
d = d(p) * m

Bivariate Analysis

Scatter plot [ keep, discard and transform variables ]
Correlation [ X1, X2 ]

Variable Transformation

Mean of multiple variables
Ratio variables
Log, exponential to make it linear

Dummy variable -> Categorical value

df.get_dummies(df)

Coorelation vs Causation

Classification Models

Logisitic Regression

The dependent variable must be categorical in nature.

The independent variable should not have multi-collinearity.

Logistic Regression in Machine Learning - Javatpoint

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning…

www.javatpoint.com

Linear Discriminant Analysis
K Nearest Neighbour (Regression as well as for Classification but mostly it is used for the Classification problems)
SVM (Support Vector Machine) (Classification as well as Regression)

Support Vector Machine (SVM) Algorithm - Javatpoint

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for…

www.javatpoint.com

K-Nearest Neighbor(KNN) Algorithm for Machine Learning - Javatpoint

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN…

www.javatpoint.com

There is a Car manufacturer company that has manufactured a new SUV car. The company wants to give the ads to the users who are interested in buying that SUV.

Logistic Regression

Whether the house will be sold within 3 months or not.

from sklearn.linear_model import LogisticRegressionx = df[[“price”]]y = df[“Sold”]clf_lrs = LogisticRegression()clf_lrs.fit(x, y)print(clf_lrs.intercept_, clf_lrs.coef_)help(ln)ln.predict(x)sns.jointplot(x=df[‘room_num’], y = df[“price”], data=df, kind=”reg”)

Unsupervised Learning

Distribute the data into clusters consisting of similar data points.

Intrinsic grouping in a set of unlabelled data

Marketing

Discovering distinct groups in customer data set
Customer who use internet rather than calls
Risky customers (insurance, loan)

Clustering

Distance from centroid

Euclidian distance

How to decide the number of Clusters

Elbow method (SSE)

Types of Clustering

Exclusive Clustering (Item belong to only one cluster) (K-Means Clustering)
Overlapping Clustering (Item can belong to multiple clusters) (C-Means Clustering)
Hierarchical Clustering ( Cluster with parent child relationships)

matplot -> ggplot
from sklearn.cluster import KMeans

C-Means Clustering

Soft clustering

Hierarchial Clustering

Once a decision is made to combine two clusters, it can’t be undone

Too slow for large datasets

Market Basket Analysis

Association Rule Mining

Technique that shows how items are associated to each other.
Laptop => Laptop Bags

Support, Confidence, Lift

s = frep(A, B) / N (number of transactions which contains both A and B)
c = freq(A, B) / N (number of times A, B occurs given no. of times A occurs)
l = s / (s(A) * s(B))

A, B, C, D, E

Apriori Algorithm

Confusion Matrix

Beginning Machine Learning For Developers

Installation - Anaconda documentation

Review the system requirements listed below before installing Anaconda Individual Edition. If you don't want the…

Key Elements of Machine Learning

Python Libraries

Types of ML

Life cycle

Supervised Learning

Linear Regression

Logistic Regression

Decision Tree

| Xoriant

Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what…

Machine Learning Decision Tree Classification Algorithm - Javatpoint

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but…

Random Forest

Machine Learning Random Forest Algorithm - Javatpoint

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used…

How to Choose a Machine Learning Model - Some Guidelines - DataScienceCentral.com

In this post, we explore some broad guidelines for selecting machine learning models The overall steps for Machine…

When to use Random Forest over SVM and vice versa?

begingroup$ I would say, the choice depends very much on what data you have and what is your purpose. A few "rules of…

Naive Bayes Theorem

Data Preprocessing

Classification Models

Logistic Regression in Machine Learning - Javatpoint

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning…

Support Vector Machine (SVM) Algorithm - Javatpoint

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for…

K-Nearest Neighbor(KNN) Algorithm for Machine Learning - Javatpoint

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN…

Unsupervised Learning

Written by Anup Singh