Data Analytics: October 2021

Thursday, October 28, 2021

What are the SQL constraints?

What are the SQL constraints?

Ans:

SQL constraints are used to specify rules for the data in a table. Constraints are used to limit the type of data that can go into a table

1. NOT NULL Constraint − Ensures that a column cannot have NULL value.

2. DEFAULT Constraint − Provides a default value for a column when none is specified.

3. UNIQUE Constraint − Ensures that all values in a column are different.

4. PRIMARY Key − Uniquely identifies each row/record in a database table.

5. FOREIGN Key − Uniquely identifies a row/record in any of the given database table.

6. CHECK Constraint − The CHECK constraint ensures that all the values in a column satisfies certain conditions.

7. INDEX − Used to create and retrieve data from the database very quickly.(a unique index has the same effect as a unique constraint on the same columns. So, you can think of a unique index as implementing a unique constraint.

Key Differences Between Primary key and Unique key

1. When an attribute is declared as a primary key, it will not accept NULL values. On the other hand, when an attribute is declared as Unique it can accept one NULL value.

2. A table can have only a primary key whereas there can be multiple unique constraints on a table.

3. A Clustered index is automatically created when a primary key is defined. In contrast, the Unique key generates the non-clustered index.

Key Differences Between Fact Table and Dimension Table

1. Fact table contains measurement along the dimension/attributes of a dimension table.

2. Fact table contains more records and fewer attributes as compared to dimension table whereas, dimension table contain more attributes and fewer records.

3. The table size of fact table grows vertically whereas, table size of dimension table grows horizontally.

4. Each dimension table contains a primary key to identify each record in the table whereas, fact table contains concatenated key which is a combination of all primary keys of all dimension table.

5. Dimension table has to be recorded before the creation of fact table.

6. A Schema contains fewer fact tables but more dimension tables.

7. Attributes in fact table are numeric as well as textual, but attributes of dimension table have textual attributes only.

Differences Between Data Warehouse and Data Mart

Key Differences Between Data Warehouse and Data Mart

1. Data warehouse is application-independent whereas data mart is specific to decision support system application.

2. The data is stored in a single, centralized repository in a data warehouse. As against, data mart stores data decentrally in the user area.

3. Data warehouse contains a detailed form of data. In contrast, a data mart contains summarized and selected data.

4. The data in a data warehouse is slightly denormalized while in the case of Datamart it is highly denormalized.

5. The construction of a data warehouse involves a top-down approach. Conversely, while constructing a data mart the bottom-up approach is used.

6. Data warehouse is flexible, information-oriented, and longtime existing nature. On the contrary, a data mart is restrictive, project-oriented, and has a shorter existence.

7. Fact constellation schema is usually used for modeling a data warehouse whereas in data mart star schema is more popular.

Differences Between Star and Snowflake Schema

Key Differences Between Star and Snowflake Schema

1. Star schema contains just one dimension table for one dimension entry while there may exist dimension and sub-dimension table for one entry.

2. Normalization is used in snowflake schema which eliminates the data redundancy. As against, normalization is not performed in star schema which results in data redundancy.

3. Star schema is simple, easy to understand and involves less intricate queries. On the contrary, snowflake schema is hard to understand and involves complex queries.

4. The data model approach used in a star schema is top-down whereas snowflake schema uses bottom-up.

5. Star schema uses a fewer number of joins. On the other hand, snowflake schema uses a large number of joins.

6. The space consumed by star schema is more as compared to snowflake schema.

7. The time consumed for executing a query in a star schema is less. Conversely, snowflake schema consumes more time due to the excessive use of joins.

Normalization vs Denormalization

Key Differences Between Normalization and Denormalization

1- Normalization is the technique of dividing the data into multiple tables to reduce data redundancy and inconsistency and to achieve data integrity. On the other hand, Denormalization is the technique of combining the data into a single table to make data retrieval faster.

2. Normalization is used in OLTP system, which emphasizes on making the insert, delete and update anomalies faster. As against, Denormalization is used in OLAP system, which emphasizes on making the search and analysis faster.

3. Data integrity is maintained in the normalization process while in denormalization data integrity harder to retain.

4. Redundant data is eliminated when normalization is performed whereas denormalization increases the redundant data.

5. Normalization increases the number of tables and joins. In contrast, denormalization reduces the number of tables and join.

6. Disk space is wasted in denormalization because same data is stored in different places. On the contrary, disk space is optimized in a normalized table.

Tuesday, October 26, 2021

Univariate,Bivariate and MultiVariate Analysis by EDA

# Data science life cycle:

Every Data science Beginner, working professional, student or practitioner follows a few steps while doing. I will tell you about all these steps in simple terms for your understanding.

# 1.Hypothesis definition:- A proposed explanation as a starting point for further investigation.

Ex:- A(company) wants to release a Raincoat(product) in Summer. now the company is in a dilemma whether to release the product or not. (i know its a bad idea, but for understanding, let's think this.)

# 2. Data Acquisition:- collecting the required data.

Ex:- collecting the last 10 years of data in a certain region.

# 3.Exploratory Data Analysis(EDA):-

Analysing collected data using some concepts(will see them below).

Ex: on collected data(existing data)data scientists will perform some analysis and decide, what are features/metrics to consider for model building.

# 4.Model building:-

This is where Machine learning comes into light.

#Ex:- by using metrics(outputs of EDA), they will predict(using ML )whether the product will be successful or not if it goes into the market.

# 5.Result report:-

After doing EDA and Model building, it generates results.

Ex: as a result of all the above steps we get some results, which decides whether to start production or not

# 6.final Product:-

Based on the result, we will get a product.

Ex:- if the result generated is positive, A(company) can start production. if the result is negative, A won't start production.

# Exploratory Data Analysis:-

By definition, exploratory data analysis is an approach to analysing data to summarise their main characteristics, often with visual methods.

in other words, we perform analysis on data that we collected, to find important metrics/features by using some nice and pretty visualisations.

every person takes some decisions in their life considering a few points in some situations. to be accurate at these decisions data scientist does some EDA on data.

# Exploratory Data Analysis is majorly performed using the following methods:

# Univariate analysis:-

Univariate analysis provides summary statistics for each field in the raw data set (or) summary only on one variable. Ex:- CDF,PDF,Box plot, Violin plot.(don't worry, will see below what each of them is)

# Bivariate analysis:-

Bivariate analysis is performed to find the relationship between each variable in the dataset and the target variable of interest (or) using 2 variables and finding the relationship between them.Ex:-Box plot, Violin plot.

# Multivariate analysis:-

Multivariate analysis is performed to understand interactions between different fields in the dataset (or) finding interactions between variables more than 2. Ex:- Pair plot and 3D scatter plot.

#let's download a data set from Kaggle(home for Data scientists), you can download and know more about it here →Habberman dataset.

https://www.kaggle.com/gilsousa/habermans-survival-data-set

Full python code is avaiable at

GITHUB

Multiple_linear_regression

# What is multiple_linear_regression

* Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.

* Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.

* MLR is used extensively in econometrics and financial inference.

* Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how the dependent variable changes as the independent variable(s) change.

* Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

* How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).

* The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

*---------------------------------------------------------------------------------------------------------

* Linear regression is one to one but Multiple regression is many to one relationship

* Adding more independent variables to the model doesn't mean that regression will give better in fact it will be worse. This can be overfitting

* Adding more independent variables will lead to multicollinearity.

* The idea of all independent variables to be co-related with the dependent variables,s not with each other.

Python code is available on GITHUB

GITHUBLINK

Multiple Examples of Multiple Linear regression are available on below link:

GITHUB

Multi-Collinearity in Machine Learning

# What is multicollinearity?

>Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

>Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model

# Why Multi-Collinearity is a problem?

When independent variables are highly correlated, change in one variable would cause change to another and so the model results fluctuate significantly. The model results will be unstable and vary a lot given a small change in the data or model. This will create the following problems:

1>It would be hard for you to choose the list of significant variables for the model if the model gives you different results every time.

2>Coefficient Estimates would not be stable and it would be hard for you to interpret the model. In other words, you cannot tell the scale of changes to the output if one of your predicting factors changes by 1 unit.

3>The unstable nature of the model may cause overfitting. If you apply the model to another sample of data, the accuracy will drop significantly compared to the accuracy of your training dataset

# How to identify that multicollinearity exists?

1> correlation is greater > 0.8 between 2 variables

2>Variance inflation factor(VIF) >20

3>R Squared& Adj R- Squared value should in between 0 to 1 [As close to 1 it will be good ]

4>Check the Coefficient value should not be high

5>If the Coefficient value is negative then it means that newspaper price change in 1 unit price will decrease by .0010 (As co-efficient value is negative)

6>Standard error should not be high it means multi-co-relation exists

7>Higher p-value should be ignored

Python code for multicollinearity is available on the below link:

GITUBLINK

Linear Regression Indepth

# What is a Regression

* In Regression, we plot a graph between the variables which best fit the given data points. The machine learning model can deliver predictions regarding the data.

* In naïve words, “Regression shows a line or curve that passes through all the data points on a target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum.”

# Types of Regression models

* Linear Regression

* Polynomial Regression

* Logistics Regression

# Linear Regression in Machine Learning:

* Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc.

* Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable.

* The linear regression model provides a sloped straight line representing the relationship between the variables

* Linear regression is a quiet and simple statistical regression method used for predictive analysis and shows the relationship between the continuous variables. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), consequently called linear regression.

* If there is a single input variable (x), such linear regression is called simple linear regression. And if there is more than one input variable, such linear regression is called multiple linear regression.

* The linear regression model gives a sloped straight line describing the relationship within the variables.

y=mx+c =a0+a1x

# Finding the best fit line:

* When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error.

* The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.

# Cost function-

* The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and the cost function is used to estimate the values of the coefficient for the best fit line.

* Cost function optimizes the regression coefficients or weights.

* It measures how a linear regression model is performing.

* We can use the cost function to find the accuracy of the mapping function, which maps the input variable to the output variable. This mapping function is also known as Hypothesis function.

* For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error occurred between the predicted values and actual values.

* Using the MSE function, we will change the values of a0 and a1 such that the MSE value settles at the minima. Model parameters xi, b (a0,a1) can be manipulated to minimize the cost function. These parameters can be determined using the gradient descent method so that the cost function value is minimum

* It can be written as:

Math

* Given our simple linear equation y=mx+b, we can calculate MSE as:

* MSE=1N∑i=1n(yi−(mxi+b))2

* MSE=12N∑i=1n(yi−(W1x1+W2x2+W3x3))2

* N is the total number of observations (data points)

* 1N∑ni=1 is the mean

* yi is the actual value of an observation and mxi+b is our prediction

# Residuals:

* The distance between the actual value and predicted values is called residual. If the observed points are far from the regression line, then the residual will be high, and so cost function will high.

* If the scatter points are close to the regression line, then the residual will be small and hence the cost function.

# Gradient descent:

* Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.

* In the gradient descent algorithm, the number of steps you take is the learning rate, and this decides how fast the algorithm converges to the minima.

* To update a0 and a1, we take gradients from the cost function. To find these gradients, we take partial derivatives for a0 and a1.

* We can calculate the gradient of this cost function as:

* f′(m,b)=⎡⎣dfdmdfdb⎤⎦=[1N∑−xi⋅2(yi−(mxi+b))1N∑−1⋅2(yi−(mxi+b))]=[1N∑−2xi(yi−(mxi+b))1N∑−2(yi−(mxi+b))]

# Summary:

* In Regression, we plot a graph between the variables which best fit the given data points. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis).To calculate best-fit line linear regression uses a traditional slope-intercept form. A regression line can be a Positive Linear Relationship or a Negative Linear Relationship.

* The goal of the linear regression algorithm is to get the best values for a0 and a1 to find the best fit line and the best fit line should have the least error. In Linear Regression, Mean Squared Error (MSE) cost function is used, which helps to figure out the best possible values for a0 and a1, which provides the best fit line for the data points. Using the MSE function, we will change the values of a0 and a1 such that the MSE value settles at the minima. Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE).

# Assumptions of Linear Regression:

Below are some important assumptions of Linear Regression. These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset.

* Linear relationship between the features and target:

Linear regression assumes the linear relationship between the dependent and independent variables.

* Small or no multicollinearity between the features:

Multicollinearity means high-correlation between the independent variables. Due to multicollinearity, it may difficult to find the true relationship between the predictors and target variables. Or we can say, it is difficult to determine which predictor variable is affecting the target variable and which is not. So, the model assumes either little or no multicollinearity between the features or independent variables.

* Homoscedasticity Assumption:

Homoscedasticity is a situation when the error term is the same for all the values of independent variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter plot.

* Normal distribution of error terms:

Linear regression assumes that the error term should follow the normal distribution pattern. If error terms are not normally distributed, then confidence intervals will become either too wide or too narrow, which may cause difficulties in finding coefficients.

It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which means the error is normally distributed.

* No autocorrelations:

The linear regression model assumes no autocorrelation in error terms. If there will be any correlation in the error term, then it will drastically reduce the accuracy of the model. Autocorrelation usually occurs if there is a dependency between residual errors.

PYTHON Code available in GITHUB Please find the code:

Simple LinearRegression

Thursday, October 21, 2021

Constructors VS Destructors in Python

Difference between Constructors VS Destructors in Python

What is Constructors in Python?

ANS:
#Constructors are generally used for instantiating an object.
#The task of constructors is to initialize(assign values) to the data members of the class
#when an object of class is created.
#In Python the __init__() method is called the constructor and is always called when an object is created.

#Syntax of constructor declaration :

#def __init__(self):
# body of the constructor
#Types of constructors :

#1>default constructor :
#The default constructor is simple constructor which doesn’t accept any arguments.It’s definition has only one argument
#which is a reference to the instance being constructed.
#2>parameterized constructor :
#constructor with parameters is known as parameterized constructor.
#The parameterized constructor take its first argument as a reference to the instance being constructed known as self
#and the rest of the arguments are provided by the programmer.
#python # #reference #oops#dataanalytics

What is Destructors in Python?

ANS:
#Destructors are called when an object gets destroyed. In Python,
#destructors are not needed as much needed in C++ because Python has a garbage collector that handles memory management
#automatically.
#The __del__() method is a known as a destructor method in Python.
#It is called when all references to the object have been deleted i.e when an object is garbage collected.
#Syntax of destructor declaration :

#def __del__(self):
# body of destructor
#Here is the simple example of destructor.
#By using del keyword we deleted the all references of object ‘obj’, therefore destructor invoked automatically.
#Generally, Python’s garbage collector which is used to detect these types of cyclic references would remove it
#but in this example the use of custom destructor marks this item as “uncollectable”.
#Simply, it doesn’t know the order in which to destroy the objects, \
#so it leaves them. Therefore, if your instances are involved in circular references they will live in the memory for as long as the application run.

Python Class

Data Analytics

Thursday, October 28, 2021

What are the SQL constraints?

Key Differences Between Primary key and Unique key

Key Differences Between Primary key and Unique key

Key Differences Between Fact Table and Dimension Table

Key Differences Between Fact Table and Dimension Table

Differences Between Data Warehouse and Data Mart

Key Differences Between Data Warehouse and Data Mart

Differences Between Star and Snowflake Schema

Key Differences Between Star and Snowflake Schema

Normalization vs Denormalization

Key Differences Between Normalization and Denormalization

Tuesday, October 26, 2021

Univariate,Bivariate and MultiVariate Analysis by EDA

Multiple_linear_regression

Multi-Collinearity in Machine Learning

Linear Regression Indepth

Thursday, October 21, 2021

Constructors VS Destructors in Python

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

Search This Blog