Data Science | Statistics | Machine Learning

Some basic statistical concepts for data scientists and how to use them

Maths and statistics are powerful tools in the world of data science. Math and Statistics are essential because these two fields form the basics of all the machine learning algorithms. And in order to succeed as a Data Scientist, you must know your basics.

Statistics is the use of maths to perform technical analysis on the data to gain meaningful insights. With statistics, we can operate on the data in an information-driven and targeted manner.

So, how is data science different from statistics? While the fields are closely related in the sense that both data scientists and statisticians aim to…

Machine Learning|Neural Networks|Activation functions|

Popular activation functions and their use

An activation function is an internal state of a neuron that converts an input signal to an output signal.

Basically, a neuron calculates the weighted sum of its inputs, adds the bias, and then inputs the values to the activation function which decides whether it should spit an output or not. Activation functions provide non-linear properties to the neural network. Without the activation function, the output values from the neurons can range between (- infinity) to (+infinity).

We are all aware of feature scaling and why it’s done. Feature scaling is performed during data pre-processing and is done to normalize/standardize…

Polynomial Regression | Data Science | Machine Learning

Determining the best degree of polynomial to choose in a polynomial regression.

In this article we will learn what is Bayesian Information Criterion (BIC) and how it is used to choose the degree of a polynomial in a Polynomial Regression.

Sometimes R2 values vary slightly across two different degrees of polynomials. i.e. comparing a R2 score = 88.3% to R2 score = 88.4%. Also, how do we know which is better. R2=88% or R2=90% ?

Let’s study this by creating some dummy data:

Let’s fit the model with Ordinary Least Square (OLS). This package provides detailed stats summary like AIC, BIC etc.

Checklist pdf | Data Preparation | Exploratory Data Analysis | Data Mining

A reference checklist for Data Analytics professionals

If you fail to plan, you plan to fail. Every project requires planning. Building a machine learning model is no different. In this article, we will learn how to plan your data mining activities and what are the steps you should perform during Exploratory Data Analysis (EDA). This article is not a ‘how-to’ guide but a reference checklist for data analytics professionals. It will provide you with a list of considerations when building a machine learning model.

We have all heard about CRISP-DM: Cross Industry Standard Process for Data Mining. …


Choosing the right hyper parameter values using Cross-Validation

Simple linear regression suffers from two major flaws:

  1. It’s prone to overfitting with many input features and,
  2. It cannot easily express non-linear/curvy relationships.

One way to tackle these issues is by increasing the model complexity. Model complexity can be increased by using Decision trees and Polynomial regression to represent non-linear relationships.

These algorithms are also prone to overfitting due to increasing complexity. Therefore, in order to represent non-linear functions without overfitting we make use of regularization techniques.

Regularization techniques are used to calibrate the linear/non-linear regression models in order to minimize the adjusted loss function and prevent overfitting.

The two…

AI replacing jobs | Jobs AI will replace | Jobs and AI

How to augment machine intelligence and what the future will look like for humans in terms of jobs

Image created by author

Advances in smart assistants like Alexa and Google have brought remarkable convenience into our day to day lives. e.g. seeking a quick weather report, translating languages, listening to world news, and today you can also send virtual hugs to your Alexa contacts. With recent Artificial Intelligence (AI) breakthroughs like AlphaGo, IBM Watson, self-driving cars, and many more, the concern of AI taking over our jobs is real.

Can you imagine the impact of these applications on humans as they advance? Eventually, everything would be done for you by an AI. Now, the question is, what value would you be adding…

Statistics | z score table | z score formula

Understanding how z-scores were invented are how they are used

In this article we will find answers to the following questions:

  1. What is a Z-score — Formula and definition.
  2. How to use Z-score using a toy example.

History: The letter ‘Z’ in z-score stands for Zeta (6th letter of the Greek alphabet) which comes from the Zeta Model that was originally developed by Edward Altman to estimate the chances of a public company going bankrupt. Z-scores exist in zones of probability, which indicates the likelihood of a public company going bankrupt.

  • z < 1.81 - Distress “Zone”
  • 1.81 < z< 2.99 - Grey “Zone”
  • z > 2.99 - Safe “Zone”

Data preparation | Machine Learning | Data Science

Feature scaling in python. Understanding why feature scaling is required and the two common types of feature scaling methods

Normalization vs Standardization

In this article we will discover answers to the following questions:

  1. What is feature scaling and why it is required in Machine Learning (ML)?
  2. Normalization — pros and cons.
  3. Standardization — pros and cons.
  4. Normalization or Standardization. Which one is better.
Image created by Author

First things first, let’s hit up an analogy and try to understand why we need feature scaling. Consider building a ML model similar to making a smoothie. And this time you are making a strawberry-banana smoothie. Now, you have to carefully mix strawberries and bananas to make the smoothie taste good. If you just mix one

Data preparation | One hot encoding | Data encoding

Understanding why machines require categorical data encoding

If you are reading this, I am assuming you already know what encoding means. Nevertheless, I’ll give a brief intro for those who are new to data science.

Note — Throughout this article, the terms; features, columns and variables have been used interchangeably.

Data is classified as below:

Image created by author

The What and Why

1. What is Categorical Encoding and why do you need it?

For some Machine Learning algorithms, whenever you have categorical data, you have to convert it to numerical type. The reason you convert categorical columns to numerical is so that the machine learning algorithm is able to understand and process it. …

Machine Learning| Data Science

What is data science | What is Machine Learning

This article will help you understand what is Machine Learning (ML) and why it is called Machine Learning, and how it resembles with how other living creatures learn and evolve right from their birth. Let’s see below:

Machine Learning is an art of teaching computers or letting them learn patterns from the data. This is often done because it helps in making informed decisions and sometimes accurate predictions. I promise that’ all the definition you’ll have in this article.

Always remember, if you can hit upon an analogy of what the unknown concept is like, you are half way there. The rest is to practice explaining to a 6 year old. So let’s take an easy example to understand ML:

  1. Learning the pattern
Image created by author

Let’s say a baby panda while playing in the forest sees a burning bright…

Swapnil Kangralkar

Data Scientist and Project Management Professional at Government of Canada. Visit for more.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store