Beginning Machine Learning For Developers
Anaconda Navigator
Key Elements of Machine Learning
- Representation
- Evaluation (accuracy, prediction and recall, probability)
- Optimization
Inductive Learning is where we are given examples of a function in the form of data (x) and the output of the function (f(x)). The goal of inductive learning is to learn the function for new data (x).
- Classification: when the function being learned is discrete.
- Regression: when the function being learned is continuous.
- Probability Estimation: when the output of the function is a probability.
Python Libraries
- matplotlib [ Charts ]
- Seaborn [ Heat maps, charts ]
- Scikit Learn [ Data mining, Data analysis ]
- Pandas [ Data wrangling, data manipulation, aggregation and visualization ]
- Numpy [ Numerical python, n-arrays and matrices, mathematical purposes ]
Types of ML
- Supervised ( Training data set )
- Reinforced ( Reward, learn from consequences of action )
- Unsupervised (Unlabeled data -> Clusters)
Life cycle
Collecting Data -> Data Wrangling -> Analyse Data -> Train Algorithm -> Test Algorithm -> Deployment
- Data wrangling -> filtering, cleaning, feature engineering (https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10)
- Analyze -> Create models
- Training -> Check Accuracy
- Deployment -> Operation and Optimization
- Retrain -> based upon accuracy
- Prediction
Supervised Learning
- Linear Regression [ y = a + bX]
- Logistic Regression ( Categorical )
- Decision Tree ( Classification-> Categorical/Continous Dependent)
- Random Forest (Ensemble of decision tree, gives better prediction accuracy than Decision Tree)
- Naive Bayes Classifier (Bayes’ Theorem with an assumption of independence between predictors)
Linear Regression
- Dependent variable, Independent Variable -> Relations
- Co-relation -> Gives mutual relation -> Plot using seaborn
- Regression Line -> Best fits the data with minimum standard error (MSE)
Example
- Pricing of Real state
- Make predictions for next independent variable
Model Fitting
- Underfit ->
- Appropriate
- Overfit -> Fits for training data only appropriately
Logistic Regression
Analysing the dataset in which there are one or more independent variables that determine an outcome.
- Outcome -> Discrete
S-Curve (0 or 1)
How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day?
Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?
Decision Tree
Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter.
- Leaves => decisions or the final outcomes.
- Decision nodes are where the data is split.
- Classification trees (Yes/No types)
- Regression trees (Continuous data types)
Predict whether a person is fit given their information like age, eating habit, and physical activity, etc
Consider a piece of data collected over the course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day. Now, our job is to build a predictive model which takes in above 4 parameters and predicts whether Golf will be played on the day. We’ll build a decision tree to do that using ID3 algorithm.
Logistic Regression Using Decision Tree
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
import matplotlib.pyplot at plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifierdata = pd.read_csv("data.csv", header=0)
print(data.head(6))
data.info()data.drop("Unnamed: 32", axis-1, inplace=True)
data.columns
data.drop("id", axis-1, inplace=True)
data["diagnosis"] = data["diagnosis"].map({"M":1, "B": 0})
data.describe()sns.countplot(data["diagnosis"], label="count")
corr = data.corr()
sns.heatmap(corr, .....)prediction_var = ["texture_mean", "perimeter_mean", "smoothness_mean", "compactness_mean", "symmetry_mean"]train, test = train_test_split(data, test_size = 0.3)train_X = train[prediction_var]
train_y = train.diagnosislogistic = LogisticRegression()
logistic.fit(train_X, train_y)
temp = logistic.predict(test_X)
metrics.accuracy_score(temp, test_y)clf = DecisionTreeClassifier(random_state=0)
cross_val_score(clf, train_X, train_y, cv=10)
clf.fit(train_X, train_y, sample_weight=None, check_input=True, X_idx_sorted=None)
clf.get_params(deep=True)
clf.predict(test_X, check_input=True)
clf.score(test_X, test_y, sample_weight=None))
Random Forest
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.
Ensemble classifier made using many Decision tree models
Combines the result from different models.
How to Choose a Machine Learning Model ?
multiclass or multinomial classification is the problem of classifying instances into one of three or more classes
Naive Bayes Theorem
Classification Technique
P(H|E) = P(E|H).P(H) / P(E)
News Categorization
Spam Filtering
Weather Predictions
Run or walk (date, time, username, wrist, gyro_x, gyro_y, gyro_z)
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import consusion_matrix, classification_report
Scikit Learn
from sklearn.linear_model import LinearRegression
y = df[“price”]
x = df[[“room_num”]]
ln = LinearRegression()
ln.fit(x, y)
print(ln.intercept_, ln.coef_)
help(ln)
ln.predict(x)
sns.jointplot(x=df[‘room_num’], y = df[“price”], data=df, kind=”reg”)
- Logistic Regression
- train test split
- metrics
- Decision tree classifier -> Test accuracy and model
- Random Forest
Data Preprocessing
Univariate Analysis
- Mean
- Median
- Mode
- Quartiles
- Standard Deviation
- Variance
EDD => Presence of outliers -> df.describe()
skewed ?? has outliers ??
Plot to see outliers (Jointplot, countplot)
Observations from EDD
- Missing values
- Skewness or outliers
- variables with no significance
Outlier Treatment
- Capping and Flooring
Seasonality in Data
- Normallize (m = u(year) / u(month))
- d = d(p) * m
Bivariate Analysis
- Scatter plot [ keep, discard and transform variables ]
- Correlation [ X1, X2 ]
Variable Transformation
- Mean of multiple variables
- Ratio variables
- Log, exponential to make it linear
Dummy variable -> Categorical value
df.get_dummies(df)
Coorelation vs Causation
Classification Models
- Logisitic Regression
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
- Linear Discriminant Analysis
- K Nearest Neighbour (Regression as well as for Classification but mostly it is used for the Classification problems)
- SVM (Support Vector Machine) (Classification as well as Regression)
There is a Car manufacturer company that has manufactured a new SUV car. The company wants to give the ads to the users who are interested in buying that SUV.
Logistic Regression
Whether the house will be sold within 3 months or not.
from sklearn.linear_model import LogisticRegressionx = df[[“price”]]y = df[“Sold”]clf_lrs = LogisticRegression()clf_lrs.fit(x, y)print(clf_lrs.intercept_, clf_lrs.coef_)help(ln)ln.predict(x)sns.jointplot(x=df[‘room_num’], y = df[“price”], data=df, kind=”reg”)
Unsupervised Learning
Distribute the data into clusters consisting of similar data points.
Intrinsic grouping in a set of unlabelled data
Marketing
- Discovering distinct groups in customer data set
- Customer who use internet rather than calls
- Risky customers (insurance, loan)
Clustering
Distance from centroid
Euclidian distance
How to decide the number of Clusters
- Elbow method (SSE)
Types of Clustering
- Exclusive Clustering (Item belong to only one cluster) (K-Means Clustering)
- Overlapping Clustering (Item can belong to multiple clusters) (C-Means Clustering)
- Hierarchical Clustering ( Cluster with parent child relationships)
matplot -> ggplot
from sklearn.cluster import KMeans
C-Means Clustering
Soft clustering
Hierarchial Clustering
Once a decision is made to combine two clusters, it can’t be undone
Too slow for large datasets
Market Basket Analysis
Association Rule Mining
- Technique that shows how items are associated to each other.
- Laptop => Laptop Bags
Support, Confidence, Lift
- s = frep(A, B) / N (number of transactions which contains both A and B)
- c = freq(A, B) / N (number of times A, B occurs given no. of times A occurs)
- l = s / (s(A) * s(B))
A, B, C, D, E
Apriori Algorithm
Confusion Matrix