Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.
Types of statistics
1) Descriptive statistics
Descriptive statistics is understanding, analyzing, and summarizing the data in form of numbers and graphs. We analyze the data using different plots and charts on different kinds of data(numerical and categorical) like bar plots, pie charts, scatter plots, histograms, etc. All kind of interpretation and visualization is part of descriptive statistics. It is important to remember that descriptive statistics can be performed on a sample and population data but we will never get or use population data.
2) Inferential statistics
We are extracting some sample of data from population data, and from that sample of data, we are inferencing something(driving conclusion) for population data. A sample of data is used as a basis for making a conclusion about that population. This can be achieved by various techniques such as data visualization and manipulation.
Types of Data
1) Numerical Data –
Numerical data simply means Numbers or integers. Numeric data is further divided into 2 categories discrete and continuous numerical variables.
I) Discrete Numerical variables — The concept of discrete variables refers to variables with an finite range of values, such as rank in the classroom, number of professors in the department, etc.
II) Continuous Numeric variable — Continuous variables are those whose value can range infinite, means not in the proper range for example salary of an employee.
2) Categorical Data –
A categorical data type is a programming string or character type of data, such as name and color. generally, these are also of 2 types.
I) Ordinal Variables — ordinal categorical variables whose values can be ranked among a range of values, such as the grade of a student (A, B, C), or high, medium, or low.
II) Nominal Variables — These variables are not ranked, simply containing names or a number of categories, like a color name, subject, etc.
The measure of Central Tendency
A measure of central tendency gives an idea of the centrality of data means what is the center of your data. There are several terms in it such as mean, median, and mode.
I) Mean –
The mean of a particular numeric variable is merely the average of all the numbers in it. when data contains outliers then finding the mean and using it in any kind of manipulation is not suggested because a single outlier affects the mean badly. so its solution is median.
II) Median –
the median is a center value after sorting all the numbers. if the total number is even then it is the average of center 2 values. It does not depend on or affected outliers till half of the data does not become outliers.
III) Mode –
The mode of a numeric variable represents the most frequent observation. Numpy does not provide a function for finding mode, but Scipy does.
Never use one of these three things, try to use all three so that you can understand the nature of the data.
Measures of Spread
Measures of spread help to understand spreads of data means where your data is more spread (positive, negative, center)
I) Range –
You define range by comparing the maximum and minimum of your data (max-min).
II) Percentiles –
Statistics uses percentiles to represent the value below which a given percentage of observations in a group falls. for example, the 20th percentile is a value below which 20 percent of data falls. we use percentile much in a real-world scenario like in an exam. Assuming the 20th percentile is 35, we can say that the total observation for the 20th percentile is less than 1.
III) Quartiles –
Quartiles are values divided into quarters by a list of numbers. the steps to find the quartile is.
- Organize the list of numbers in order
- Then cut the list into 4 equal parts
- he quartiles will appear at the cuts
the median is also known as Q2. The 4 quartiles can be found by depicting the percentile at 25, 50, 75, and 100.
It describes the variation in the data by describing the absolute deviation from the mean, also known as mean absolute deviation (MAD).
IV) Interquartile Range(IQR) — The Interquartile Range (IQR) is a measure of dispersion between the top 75th and the bottom 25th percentiles. There are many calculations and data preprocessing steps that use this term, such as dealing with outliers.
V) Mean Absolute Deviation –
It describes the variation in the data by describing the absolute deviation from the mean, also known as mean absolute deviation (MAD). In simple words, it tells the average absolute distance of each point in the set.
MAD = ∑∣xi−xˉ∣ / n
Eg- set of numbers = [2, 5, 1, 6, , 7, 3, 11]
the mean is 5
find difference — |2–5| + |5–5| + |1–5| + |6–5| + |7–5| + |3–5| + |11–5|
3+0+4+1+2+2+6 = 18/7 = 2.5
VI) Variance –
The variance measures how far the data point is from the mean, and we take the square root of the difference between MAD and variance. To compute variance, you find the difference between each data point and the mean, square them, sum them up, and take the average. You can compute variance directly with numpy.
The problem with variance
The problem with variance is that because a square, it is not in the same unit of measurement as of original data. Because it’s not intuitive, most people preferred standard deviation.
VII) Standard Deviation — The square root of variance is the standard deviation because we squared the original unit, so we get the standard deviation in the same measurement again. With Numpy, we can calculate this directly.
VIII) Median Absolute Deviation — Median is the median of all the numbers obtained by subtracting each observation and computing the absolute value of each observation. NumPy does not have a function for MAD but stats models have a package called robust which contains the MAD function.
Normal Distribution
A normal distribution is a distribution in form of a bell curve and most of the datasets in machine learning follow a normal distribution if not then we try to transform it into normal distribution and many machine learning algorithms work very well on this distribution because in real-world scenario also many use cases follow this distribution like salary, very fewer employees will be there that are having less salary, and very less employee with very high salary and most of the employees will lie in middle or in the medium range.
If any data follows Normal or Gaussian distribution then it also follows three conditions known as Empirical Formula
P[ mean — std_dev <= mean + std_dev] = 68 percent
P[ mean — 2* std_dev <= mean + 2*std_dev] = 95 percent
P[ mean — 3*std_dev <= mean + 3*std_dev] = 99.7 percent
You analyze this thing while performing exploratory data analysis. We can also convert any variable distribution into a Standard Normal Distribution.
Skewness
Skewness is a measure of the symmetry of distribution that you plot in form of a histogram, KDE which has a high peak towards the mode of data. skewness is generally of 2 types left-skewed data and right-skewed data. Some also understand it as 3 types where the third is symmetric distribution means Normal distribution.
I) Right Skewed Data(Positively skewed distribution)
Right skewed distribution means data that has a long tail towards the right side(Positive axis). A classical example of right-skewed is wealth distribution, there are very less few people who have very high wealth and maximum people are in the medium range. I would encourage you to mention more examples in the comment box.
II) Left-skewed Data(Negatively skewed distribution)
Left skewed distribution means data that has a long tail towards the left side(negative axis). An example can be Grades of students, there will be fewer students who have received fewer grades and the maximum student will be in the pass category.
Central Limit Theorem
Central Limit Theorem states that, when we analyze the sample data of any population then after doing some statistical measure, you see that mean of standard deviation and sample mean will be approximately the same. This is only the central Limit theorem.
Probability Density Function(PDF)
If you know about Histogram then where you cut the data into bins, and visualize the spread. But if we want to do a multiclass sort of analysis on numerical data, then it is difficult to do using Histogram and It can be easily be done using PDF. The probability density function is that line drawn within a histogram only using KDE(kernel density estimation). So, if you see the KDE curve it passes by touching the corner of each bin. Hence using PDF we can draw side by side KDE and analysis multiclass data.
In the above plot if I ask that, I want to write 3 conditions that distinguish classify 3 classes then how can I do that? Let us do this using histogram and PDF and understand the comparison between both.
From the above histogram, we can write that if a value is less than 2 then It is setosa. If greater than 2 and less than 4.5 then it’s Versicolor, then up to a certain point we are right but after 4.5 also there is Versicolor and from 5 to 7 it’s Virginica. the overlap area will disturb you, and here PDF is going to help you out.
Now PDF will help you to write some smart cases. Condition for setosa is the same as less than 2. For another 2 classes, we can take the intersecting point of both the curve the chances for correct classification will increase compared to a histogram.
Cumulative Distributive function(CDF)
CDF tells us how much percentage of data is less than a particular number. To find the CDF we will add all the histogram bind before that point and the result will be my resultant CDF. Another method to do so is using Calculus using Area under the curve, so at what point you want CDF, plot the straight line from that and find the inner area. Hence on integrating PDF we get CDF and on differentiating CDF we get PDF.
How to calculate PDF and CDF
we will calculate the PDF and CDF for setosa. we convert the petal length into 10 bins and extract out the frequency count of each bin and the edges of each which give the point where the bin starts and where it ends. To calculate PDF we divide each frequency count value by the total sum and we get the probability density function if we find the cumulative sum of PDF we get CDF.
counts, bin_edges = np.histogram(iris_setosa[‘PL’], bins=10)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
print(pdf)
print(cdf)