Posts

Showing posts from 2021

Azure-data-factory introduction

  What is Azure Data Factory? * Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects. *  Azure Data Factory is the platform that solves such data scenarios. It is the   cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale . Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database. * Additionally, you can publish your transformed data to data stores such as Azure Synapse Analytics for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can ...

Bagging in Machine Learning

 What Is Ensemble Learning? * Machine Learning uses several techniques to build models and improve their performance. * Ensemble learning methods help improve the accuracy of classification and regression models. * Ensemble learning is a widely-used and preferred machine learning technique in which multiple individual models,    often called base models, are combined to produce an effective optimal prediction model. * The Random Forest algorithm is an example of ensemble learning. What Is Bagging in Machine Learning? * Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the performance and  accuracy of machine learning algorithms. * It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model.  * Bagging avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms.' What Is Bootstrapping? * Bootstrapping is the m...

Random Forest intution in Machine Learning

What is a random forest? * A random forest is a machine learning technique that’s used to solve regression and classification problems.  * It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems. * A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. * Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms. * The (random forest) algorithm establishes the outcome based on the predictions of the decision trees.  * It predicts by taking the average or mean of the output from various trees. * Increasing the number of trees increases the precision of the outcome. *A random forest eradicates the limitations of a decision tree algorithm.  * It reduces the overfitting of datasets and increases precision.  * It generates predictions without requiring many co...

K Nearest Neighbors (KNN) intution

  K Nearest Neighbor (KNN) In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. What is KNN? Machine learning models use a set of input values to predict output values. KNN is one of the simplest forms of machine learning algorithms mostly used for classification. It classifies the data point on how its neighbor is classified Introduction: K Nearest Neighbor algorithm falls under the Supervised Learning category and is used for classification (most commonly) and regression.  It is a versatile algorithm also used for imputing missing values and resampling datasets. As the name (K Nearest Neighbor) suggests it considers K Nearest Neighbors (Data points) to predict the class or continuous value for the new Datapoint. How to choose the value for K?¶ Using error curves: The figure below shows error curves for different values of K for training and test data. At low K values, there is overfitting of data...

Confusion Matrix using scikit-learn in Python

Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix True Positive (TP)  The predicted value matches the actual value The actual value was positive and the model predicted a positive value True Negative (TN)  The predicted value matches the actual value The actual value was negative and the model predicted a negative value False Positive (FP) – Type 1 error The predicted value was falsely predicted The actual value was negative but the model predicted a positive value Also known as the  Type 1 error False Negative (FN) – Type 2 error The predicted value was falsely predicted The actual value was positive but the model predicted a negative value Also known as the  Type 2 error GITHUB Link

What are the SQL constraints?

What are the SQL constraints? Ans: SQL constraints  are used to specify rules for the data in a table.  Constraints  are used to limit the type of data that can go into a table   1.      NOT NULL Constraint  − Ensures that a column cannot have NULL value. 2.      DEFAULT Constraint  − Provides a default value for a column when none is specified. 3.      UNIQUE Constraint  − Ensures that all values in a column are different. 4.      PRIMARY Key  − Uniquely identifies each row/record in a database table. 5.      FOREIGN Key  − Uniquely identifies a row/record in any of the given database table. 6.      CHECK Constraint  − The CHECK constraint ensures that all the values in a column satisfies certain conditions. 7.      INDEX  − Used to create and retrieve data from the database very qu...

Key Differences Between Primary key and Unique key

  Key Differences Between Primary key and Unique key 1.     When an attribute is declared as a primary key, it will not accept NULL values. On the other hand, when an attribute is declared as Unique it can accept one NULL value. 2.     A table can have only a primary key whereas there can be multiple unique constraints on a table. 3.     A Clustered index is automatically created when a primary key is defined. In contrast, the Unique key generates the non-clustered index.  

Key Differences Between Fact Table and Dimension Table

  Key Differences Between Fact Table and Dimension Table 1.     Fact table contains measurement along the dimension/attributes of a dimension table. 2.     Fact table contains more records and fewer attributes as compared to dimension table whereas, dimension table contain more attributes and fewer records. 3.     The table size of fact table grows vertically whereas, table size of dimension table grows horizontally. 4.     Each dimension table contains a primary key to identify each record in the table whereas, fact table contains concatenated key which is a combination of all primary keys of all dimension table. 5.     Dimension table has to be recorded before the creation of fact table. 6.     A Schema contains fewer fact tables but more dimension tables. 7.     Attributes in fact table are numeric as well as textual, but attributes of dimension table have textu...

Differences Between Data Warehouse and Data Mart

  Key Differences Between Data Warehouse and Data Mart 1.     Data warehouse is application-independent whereas data mart is specific to decision support system application. 2.     The data is stored in a single,  centralized  repository in a data warehouse. As against, data mart stores data  decentrally  in the user area. 3.     Data warehouse contains a  detailed  form of data. In contrast, a data mart contains  summarized  and selected data. 4.     The data in a data warehouse is  slightly  denormalized while in the case of Datamart it is  highly  denormalized. 5.     The construction of a data warehouse involves a  top-down  approach. Conversely, while constructing a data mart the  bottom-up  approach is used. 6.     Data warehouse is  flexible ,  information-oriented,  and longtime e...

Differences Between Star and Snowflake Schema

  Key Differences Between Star and Snowflake Schema 1.     Star schema contains just  one  dimension table for one dimension entry while there may exist dimension and sub-dimension table for one entry. 2.     Normalization is used in snowflake schema which eliminates the data redundancy. As against, normalization is not performed in star schema which results in data redundancy. 3.     Star schema is simple, easy to understand and involves less intricate queries. On the contrary, snowflake schema is hard to understand and involves complex queries. 4.     The data model approach used in a star schema is top-down whereas snowflake schema uses bottom-up. 5.     Star schema uses a fewer number of joins. On the other hand, snowflake schema uses a large number of joins. 6.     The space consumed by star schema is more as compared to snowflake schema. 7.     The ...

Normalization vs Denormalization

  Key Differences Between Normalization and Denormalization  1- Normalization is the technique of dividing the data into multiple tables to reduce data redundancy and inconsistency and to achieve data integrity. On the other hand, Denormalization is the technique of combining the data into a single table to make data retrieval faster. 2.    Normalization is used in OLTP system, which emphasizes on making the insert, delete and update anomalies faster. As against, Denormalization is used in OLAP system, which emphasizes on making the search and analysis faster. 3.    Data integrity is maintained in the normalization process while in denormalization data integrity harder to retain. 4.    Redundant data is eliminated when normalization is performed whereas denormalization increases the redundant data. 5.    Normalization increases the number of tables and joins. In contrast, denormalization reduces the number of tables and join. 6.  ...

Univariate,Bivariate and MultiVariate Analysis by EDA

 # Data science life cycle: Every Data science Beginner, working professional, student or practitioner follows a few steps while doing. I will tell you about all these steps in simple terms for your understanding. # 1.Hypothesis definition: - A proposed explanation as a starting point for further investigation. Ex:- A(company) wants to release a Raincoat(product) in Summer. now the company is in a dilemma whether to release the product or not. (i know its a bad idea, but for understanding, let's think this.) # 2. Data Acquisition: - collecting the required data. Ex:- collecting the last 10 years of data in a certain region. # 3.Exploratory Data Analysis(EDA):-     Analysing collected data using some concepts(will see them below). Ex: on collected data(existing data)data scientists will perform some analysis and decide, what are features/metrics to consider for model building. # 4.Model building:- This is where Machine learning comes into light. #Ex:- by using metrics(out...

Multiple_linear_regression

 # What is multiple_linear_regression * Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. * Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable. * MLR is used extensively in econometrics and financial inference. * Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how the dependent variable changes as the independent variable(s) change. * Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know: * How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect c...