Thursday, October 21, 2021

Statistics Intreview Question:

Statistics Interview Question:

Q2>LOG NORMAL DISTRIBUTION VS STANDARD NORMAL DISTRIBUTION:


 The major difference is in its shape: the normal distribution is symmetrical, whereas the lognormal distribution is not. Because the values in a lognormal distribution are positive, they create a right-skewed curve. ... A further distinction is that the values used to derive a lognormal distribution are normally distributed 

Q3>What are statistics?


Ans: Statistics is a science of acquiring, classifying, organizing, analyzing, interpreting, and presenting numerical data so as to make inferences about the population from the sample drawn.


Q4>Descriptive statistics?

Ans: Descriptive statistics is that part of statistics that quantitatively describes the characteristics of a particular dataset under study, with the help of brief summary about the sample.

Q5>What is Inferential Statistics?

Ans: It is one of the types of statistics in which a random sample is drawn from a large population ,to make the deduction about the whole population from which the sample is taken.

Difference between Descriptive and Inferential statistics :
S.No.

Descriptive Statistics

1-
It gives information about raw data which describes the data in some manner.
2-
It helps in organizing, analyzing and to present data in a meaningful manner.
3-It is used to describe a situation.
4-It explains already known data and is limited to a sample or population having a small size.
5-It can be achieved with the help of charts, graphs, tables etc.

Inferential Statistics:

1-t makes an inference about the population using data drawn from the population.
2-it allows us to compare data, make hypotheses and predictions.
3-It is used to explain the chance of occurrence of an event.
4-It attempts to reach the conclusion about the population.
5-It can be achieved by probability.
#Datascience#stats#Fundamental#ML#AI#TABLEAU#SQL#DWH

What is Covariance?
ANS:

> Covariance provides insight into how two variables are related to one another.
> More precisely, covariance refers to the measure of how two random variables in a data set will change together.
>A positive covariance means that the two variables at hand are positively related, and they move in the same directioN

What is the Advantages of the Correlation Coefficient ?

>The Correlation Coefficient has several advantages over covariance for determining strengths of relationships:
> Covariance can take on practically any number while a correlation is limited: -1 to +1.
>Because of it’s numerical limitations, correlation is more useful for determining how strong the relationship is between the two variables.
> Correlation does not have units. Covariance always has units
>Correlation isn’t affected by changes in the center (i.e. mean) or scale of the variables
>“Covariance” indicates the direction of the linear relationship between variables.
>“Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables. 
>Correlation is a function of the covariance. What sets them apart is the fact that correlation values are standardized whereas, covariance values are not.

What is MEAN,MODE & MEDIAN ?

ANS:

The mean is the average of a data set.
The mode is the most common number in a data set.
The median is the middle of the set of numbers.
#STASTICS#DATASCIENTIST#DATAANALYTICS#KEEPLEARNING#
#dataanalytics #data #datascientists

What is Central Limit Theorem?

The Central Limit Theorem(CLT) states that for any data, provided a high number of samples have been taken. The following properties hold:
1.    Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
2.  Sampling distribution’s standard deviation (Standard error) = σ/√n ≈S/√n
3.   For n > 30, the sampling distribution becomes a normal distribution.

Variance of the population, Population Variance(σ²) = Σ( Xi — μ )²/ N
Number of items/population, Sample Size = n
Mean of the sample employees, Sample Mean(x¯) = (Σ * x)/n
Variance of the sample, Sample Variance(S²) = Σ( xi — x¯)²/ n-1

What is Random Variables & Its Type
Ans:

A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous.

Discrete Random Variables:


Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.

Continuous Random Variables


A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.

A curve meeting these requirements is known as a density curve

What is Normal Distribution?
ANS:
1-A normal distribution is the proper term for a probability bell curve.
2-In a normal distribution the mean is zero and the standard deviation is 1. 3-It has zero skew and a kurtosis of 3.
4-Normal distributions are symmetrical, but not all symmetrical distributions are normal.
5-In reality, most pricing distributions are not perfectly normal.

What is Kurtosis
1-Kurtosis is a measure of the combined weight of a distribution's tails relative to the center of the distribution.
2- When a set of approximately normal data is graphed via a histogram, it shows a bell peak and most data within three standard deviations (plus or minus) of the mean.
3- However, when high kurtosis is present, the tails extend farther than the three standard deviations of the normal bell-curved distribution.
4-In Normal distribution kurtosis is 3.

What Is Meant By Linear Correlation?

The correlation coefficient is a value between -1 and +1. A correlation coefficient of +1 indicates a perfect positive correlation. As variable x increases, variable y increases. As variable x decreases, variable y decreases. A correlation coefficient of -1 indicates a perfect negative correlation. As variable x increases, variable z decreases. As variable x decreases, variable z increases.

KEY TAKEAWAYS on Pearson's Correlation Coefficient:

Correlation coefficients are used to measure the strength of the linear relationship between two variables.
A correlation coefficient greater than zero indicates a positive relationship while a value less than zero signifies a negative relationship.
A value of zero indicates no relationship between the two variables being compared.
A negative correlation, or inverse correlation, is a key concept in the creation of diversified portfolios that can better withstand portfolio volatility.
Calculating the correlation coefficient is time-consuming, so data are often plugged into a calculator, computer, or statistics program to find the coefficient.

What is  Spearman's rank correlation coefficient

1- In statistics, Spearman’s rank correlation coefficient or Spearman’s ρ, named after Charles Spearman is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

:2-The Spearman correlation can evaluate a monotonic relationship between two variables — Continous or Ordinal and it is based on the ranked values for each variable rather than the raw data.
3-: The Pearson correlation can evaluate ONLY a linear relationship between two continuous variables (A relationship is linear only when a change in one variable is associated with a proportional change in the other variable)

What is P VALUE:

ANS:
1-A p value is used in hypothesis testing to help you support or reject the null hypothesis.
2-The p value is the evidence against a null hypothesis & smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
3-A small p (≤ 0.05), reject the null hypothesis. This is strong evidence that the null hypothesis is invalid.
4-A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.
5-The term significance level (alpha=0.05) is used to refer to a pre-chosen probability and the term "P value" is used to indicate a probability that you calculate after a given study.

### What are the criteria to identify an outlier?

1. Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation
3-An outlier is an observation that lies an abnormal distance from other values in a random sample from a population


### What is the reason for an outlier to exists in a dataset?


1. Variability in the data
2. An experimental measurement error


### What are the impacts of having outliers in a dataset?

1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation


### Various ways of finding the outlier.

1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquartile range
5-Using HYPOTHESIS Testing




No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...