Friday, August 5, 2022

Standard Deviation & Variance

Standard Deviation

The Standard Deviation is a measure of how spread out numbers are.

Its symbol is σ (the greek letter sigma)

The formula is easy: it is the square root of the Variance. So now you ask, "What is the Variance?"

 

 


Formulas

Here are the two formulas, explained at Standard Deviation Formulas if you want to know more:


The "Population Standard Deviation":

 

 

 

square root of [ (1/N) times Sigma i=1 to N of (xi - mu)^2 ]

 

 

 

 

The “Sample Standard Deviation”:

 

square root of [ (1/(N-1)) times Sigma i=1 to N of (xi - xbar)^2 ]

 

 

 

 

Looks complicated, but the important change is to
divide by N-1 (instead of N) when calculating a Sample Variance.

When you have "N" data values that are:

  • The Population: divide by N when calculating Variance (like we did)
  • A Sample: divide by N-1 when calculating Variance

Variance

The Variance is defined as:

The average of the squared differences from the Mean.

To calculate the variance follow these steps:

  • Work out the Mean (the simple average of the numbers)
  • Then for each number: subtract the Mean and square the result (the squared difference).
  • Then work out the average of those squared differences. (Why Square?)

 

Example of Standard Deviation vs. Variance

 

To demonstrate how both principles work, let's look at an example of standard deviation and variance.

Suppose you have a series of numbers and you want to figure out the standard deviation for the group. The numbers are 4, 34, 11, 12, 2, and 26. We need to determine the mean or the average of the numbers. In this case, we determine the mean by adding the numbers up and dividing it by the total count in the group:

(4 + 34 + 18 + 12 + 2 + 26) ÷ 6 = 16

square root of [ (1/N) times Sigma i=1 to N of (xi - mu)^2 ]

So the mean is 16. Now subtract the mean from each number then square the result:

  • (4 - 16)2 = 144
  • (34 - 16)2 = 324
  • (18 - 16)2 = 4
  • (12 - 16)2 = 16
  • (2 - 16)2 = 196
  • (26 - 16)2 = 100

Now we have to figure out the average or mean of these squared values to get the variance. This is done by adding up the squared results from above, then dividing it by the total count in the group:

(144 + 324 + 4 + 16 + 196 + 100) ÷ 6 = 130.67

This means we end up with a variance of 130.67. To figure out the standard deviation, we have to take the square root of the variance, which is 11.43

 

The Lognormal Distribution vs. the Normal Distribution

 

 

 



 

In a normal distribution, 68% (34%+34%) of the results fall within one standard deviation, and 95% (68%+13.5%+13.5%) fall within two standard deviations. At the center (the 0 point in the image above) the median (the middle value in the set), the mode (the value that occurs most often), and the mean (arithmetic average) are all the same.

 

 

Summary

  • The lognormal distribution differs from the normal distribution in several ways. A major difference is in its shape: the normal distribution is symmetrical, whereas the lognormal distribution is not. Because the values in a lognormal distribution are positive, they create a right-skewed curve.

  • The lognormal distribution model is considered to be very useful in the fields of medicine, economics, and engineering.
  • Overall the log-normal distribution plots the log of random variables from a normal distribution curve.

Right skewed distributions with low mean values, large variance, and all positive values often fit this distribution. Example of lognormal distribution in nature are the amount of rainfall, milk production by cows, and for most natural growth processes, where the growth rate is independent of size.

 

·        This skewness is important in determining which distribution is appropriate to use in investment decision-making. A further distinction is that the values used to derive a lognormal distribution are normally distributed.

 

·        Let's clarify with an example. An investor wants to know an expected future stock price. Since stocks grow at a compounded rate, they need to use a growth factor. To calculate possible expected prices, they will take the current stock price and multiply it by various rates of return (which are mathematically derived exponential factors based on compounding), which are assumed to be normally distributed. When the investor continuously compounds the returns, they create a lognormal distribution. This distribution is always positive even if some of the rates of return are negative, which will happen 50% of the time in a normal distribution. The future stock price will always be positive because stock prices cannot fall below $0.

Difference between Descriptive and Inferential Statistics

Difference between Descriptive and Inferential Statistics

 

Descriptive Statistics

Use descriptive statistics to summarize and graph the data for a group that you choose. This process allows you to understand that specific set of observations

Descriptive statistics frequently use the following statistical measures to describe groups:

 

       I.          Central tendency: Use the mean or the median to locate the center of the dataset. This measure tells you where most values fall.

     II.          Dispersion: How far out from the center do the data extend? You can use the range or standard deviation to measure the dispersion. A low dispersion indicates that the values cluster more tightly around the center. Higher dispersion signifies that data points fall further away from the center. We can also graph the frequency distribution.

   III.          Skewness: The measure tells you whether the distribution of values is symmetric or skewed. See: Skewed Distributions

there are other descriptive analyses you can perform, such as assessing the relationships of paired data using correlation and scatterplots.

inferential          

For inferential statistics, we need to define the population and then devise a sampling plan that produces a representative sample. The statistical results incorporate the uncertainty that is inherent in using a sample to understand an entire population. The sample size becomes a vital characteristic. The law of large numbers states that as the sample size grows, the sample statistics (i.e., sample mean) will converge on the population value.

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...