V2 Data Science

I have picked up below notes from various online courses I am taking. And sometimes added notes to it on my own.

  1. What is the difference between the population and the sample? What is the difference between a parameter and a statistic?

Population and parameter are the universal set.

In statistics, a population is an entire pool from which a statistical sample is drawn. Examples of populations can be the number of newborn babies in North America, and the total number of tech startups in Asia.

A parameter is any summary number, like an average or percentage, that describes the entire population.

Since we can't measure the entire population, therefore, measuring its parameter is also difficult.

For example, the average height of adult women in the United States is a parameter that has an exact value—we just don’t know what it is! or the average height of all CFA exam candidates in the world, the mean weight of U.S. taxpayers, and so on.

We know what we have to collect and that collection is termed sample (a specific set from the population). Parameters associated with it would be sample parameters like sample mean or sample standard deviation which are referred to as statistics.

pop sample.jpg

  1. How to set the value of k for binom.pmf() and binom.cdf() functions?

If we want to calculate the probability that the random variable X is exactly equal to x, then binom.pmf(k=x,...) will be used.

If we want to calculate the probability that the random variable X is less than or equal to x, then binom.cdf(k=x,...) will be used.

If we want to calculate the probability that the random variable X is greater than or equal to x, then 1-binom.cdf(k=x-1,...) will be used.

  1. What are the Empirical rules of a normal distribution?

The empirical rule states that 68% of the observations of a normal distribution fall within the first standard deviation from the mean (µ ± σ), 95% within the first two standard deviations from the mean (µ ± 2σ), and 99.7% within the first three standard deviations from the mean (µ ± 3σ).

Let’s assume pizza delivery timings in a restaurant are known to be normally distributed.

µ(mean delivery time): 30 minutes

σ(standard deviation) : 5 minutes

Using the Empirical Rule, we can determine that,

68% of the delivery times are between 25-35 minutes (30 ± 5)

95% of the delivery times are between 20-40 minutes (30 ± 2x5)

99.7% of the delivery times are between 15-45 minutes (30 ±3x5)

  1. What is the degree of freedom?

The Degree of freedom of an estimate is the number of independent pieces of information that went into calculating the estimate. Let's understand with an example. Suppose there are 10 observations, and we have an estimate of the sum of those 10 observations. Now, if 9 of the ten observations are known to us, the 10th one can be calculated by subtracting the sum of 9 observations from the estimate of the sum of all observations. This way, if 9 of the observations are known to us, then the one left out observation is redundant. So, we say that the degrees of freedom for the estimate of the sum is 10 - 1 = 9.

This applies universally to all other estimates as well. So, in general terminology for n observations, the degree of freedom for an estimate of these n observations is k = n-1.

  1. What is PMF (Probability Mass Function), PDF (Probability Density Function), CDF (Cumulative Distribution Function), and PPF (Percent-Point Function) in simple terms?

Suppose that X is a discrete random variable. Now, when we want to calculate mass probabilities, like P(X=x), we use the PMF function. This is the probability that X takes the value x. An example of discrete distribution is Binomial Distribution.

The PDF function is similar to the PMF function except for the fact that it is used for continuous distributions. An example of a continuous distribution is Normal Distribution. If X is a continuous random variable, we calculate P(X=x) using the PDF function.

The CDF function helps us calculate the cumulative probability P(X<=x), which is the probability that X takes the value less than or equal to x. This is the cumulative distribution function and is applicable in both discrete and continuous cases.

The PPF function is an inverse form of the CDF function.

Suppose P(X<=x) = alpha,

where we are provided the alpha (probability) value and for that value, we want to calculate the value of x. This function is applicable in both discrete and continuous cases.

  1. What are the functions used in Statistical Analysis from the Scipy Stats library to calculate probabilities in a normal distribution?

pdf(x, loc=0, scale=1) - Probability density function.

cdf(x, loc=0, scale=1) - Cumulative distribution function.

ppf(q, loc=0, scale=1) - Percent point function (inverse of cdf - percentiles).

Please refer to this page for the official Scipy documentation.

  1. What is a z-score and how is it used in real-life scenarios?

A z-score (also called the standard score) measures how many standard deviations below or above the mean a data point lies. The z-score is very useful as it enables us to compare two scores coming from two different normal populations. The two scores might be on two different scales however we can compare them using the z-score.

Real-life application – Suppose you have appeared for two different competitive exams having different scoring systems. How will you compare your scores’ in two exams? Let’s assume that the competitive exams are popular and the distributions of their scores follow two different normal distributions. To compare your scores (coming from two different normal populations), you need to standardize each of your scores. Then, you can easily compare them.

Types of machine learning algorithms

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning
  • Recommender systems

'Prasanna Kulkarni'_ .. _'Prasanna Kulkarni':https://www.prasannakulkarni.com

Linear Algebra

Matrix (m*n)

is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns.

Vector (m*1)

is a one-dimensional array of numbers, symbols, or expressions. It has only one row or column.
You can do addition, subtraction, multiplication, and division of vectors and matrices.


The determinant measures how much volumes change during a transformation.It is a scalar value that can be computed from the elements of a square matrix and encodes certain properties of the linear transformation described by the matrix.

Eigenvector & Eigenvalue

is a vector that does not change direction during a linear transformation.

is a scalar factor that describes the scaling of an eigenvector during a linear transformation. - Rank of a matrix is the number of linearly independent rows or columns of a matrix. - Eigenvector is the vector that does not change the direction but changes the could change value.

Covariance metric

It helps us understand how far our random variable is spread out from the mean. Note: The function numpy.cov() in Python's Numpy variable can be used to get the covariance matrix in Python.

Dimensionality reduction

: we often deal with high-dimensional data, with lots of columns or "features" that represent the information collected about each observation

Principal component analysis (PCA)

is a dimensionality reduction technique that can be used to reduce the number of features in a dataset while retaining as much information as possible.


dimensionality reduction technique that is particularly well suited for the visualization of high-dimensional datasets. It converts similarities between data points to joint probabilities and tries to minimize the Kullback–Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. KL divergence ~~~~~~~~~~~~~ is a measure of how one probability distribution is different from a second, reference probability distribution. It is often used as a loss function for training a generative adversarial network (GAN). Two important points to note about KL divergence:

The lower the KL divergence value, the better the two distributions match. If two distributions perfectly match, then it is zero. It is not symmetrical

The Binomial/Bernoulli Distribution

A Bernoulli trial (or a Binomial trial) is a random experiment with exactly 2 possible outcomes, “success” or “failure”, in which the probability of success is the same every time the experiment is conducted. p = Probability of Success q = 1 - p = Probability of Failure n = Number of Trials x = Number of successes desired The Notation for a binomial distribution is X ~ B (n, p) The Bernoulli Distribution is a special case of the Binomial Distribution where the number of trials is equal to 1. Hence it represents the probability of an event occurring in 1 single trial, for example, the probability of getting heads when a coin is tossed only 1 time is 0.5.


Hypothesis Testing

This excerpt from a book takes a more detailed look at the foundations of hypothesis testing. This is a good summary document covering the foundational concepts of hypothesis testing and statistical inference.


An end-to-end comprehensive guide for PCA Check out this video on PCA by StatQuest: Click here This article explains Principal Component Analysis with example. This article shows how to use t-SNE. This is a good summary of the differences between PCA and t-SNE. This resource has an excellent Q&A on t-SNE



Scikit-learn is an open-source Python library that is built upon NumPy, SciPy, and Matplotlib which is typically used in machine learning projects. Scikit-learn is focused on machine learning tools including mathematical, statistical, and general purpose algorithms that form the basis for many machine learning technologies by providing functionality for dimensionality reduction, feature selection, feature extraction, ensemble techniques, and inbuilt datasets. Scikit-learn is a top choice of academic and industrial organizations for carrying out a variety of tasks because of its effectiveness and adaptability.

The essential features of scikit-learn to simplify machine learning includes: Unsupervised Learning Algorithms: This group of algorithms includes unsupervised neural networks, principal component analysis, cluster analysis, and factoring. Feature Extraction: Scikit-Learn allows you to extract features from both text and images. Dimensionality Reduction: With the help of this feature, the number of attributes in the data can be minimized for later feature selection, visualization, and summarization. Clustering: This feature allows the grouping of unlabeled data. Algorithms for Supervised Learning: There is an extremely high chance that any supervised machine learning algorithm you have heard of is included in the scikit-learn library. Such supervised learning algorithms are available in the scikit-learn toolkit and include generalized linear models, such as linear regression, decision trees, support vector machines, and Bayesian techniques. You will learn supervised learning techniques in the coming weeks. Cross-validation: Scikit-learn can be used to test the accuracy and validity of supervised models using unseen data. Ensemble methods: The predictions of various supervised models can be integrated. by using this feature.


https://www.sagepub.com/sites/default/files/upm-binaries/40007_Chapter8.pdf https://onlinestatbook.com/2/logic_of_hypothesis_testing/logic_hypothesis.pdf https://www.analyticsvidhya.com/blog/2020/12/an-end-to-end-comprehensive-guide-for-pca/ https://www.youtube.com/watch?v=FgakZw6K1QQ https://www.analyticssteps.com/blogs/introduction-principal-component-analysis-machine-learning https://distill.pub/2016/misread-tsne/ https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/ https://lvdmaaten.github.io/tsne/ https://www.youtube.com/watch?v=FgakZw6K1QQ https://www.youtube.com/watch?v=NEaUSP4YerM