Posts

RDD vs Dataframe vs Dataset

RDD vs Dataframe vs Dataset All of them are data abstraction APIs provided by Apache Spark for data processing and analytics. In terms of functionality, all are the same and provide the same output for any given input. They differ in terms of handling and processing data. They vary in performance, user convenience, and language support. Users can choose to work with any API while working with Spark. 1) RDD - RDD stands for Resilient Distributed Dataset. An RDD is an immutable distributed collection of datasets partitioned across a set of nodes of the cluster that can be recovered if a partition is lost, thus providing fault tolerance . RDDs are Spark's fundamental data structure and provide a high-level API for performing distributed data processing tasks. Resilient - RDDs are immutable, partitioned collections of records that can be recovered if a partition is lost. Distributed - RDDs are a static set of items distributed across clusters to allow parallel processing. In-built...

𝗦𝗽𝗮𝗿𝗸 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: 𝗡𝗮𝗿𝗿𝗼𝘄 𝘃𝘀. 𝗪𝗶𝗱𝗲

Image
🚀 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝗸 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: 𝗡𝗮𝗿𝗿𝗼𝘄 𝘃𝘀. 𝗪𝗶𝗱𝗲 🚀 📌In Apache Spark, the difference between narrow and wide transformations is crucial for optimizing performance. 🔹 𝗡𝗮𝗿𝗿𝗼𝘄 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: 👉🏻Includes operations like map, flatMap, filter, union, coalesce, and repartition. 👉🏻Features a one-to-one mapping between input partitions and file blocks. 👉🏻Executes within a single stage without requiring data shuffling or movement across partitions. 👉🏻Each input partition contributes to only one output partition, making it more efficient. 👉🏻Mapping between file blocks and RDD partitions is handled internally for narrow transformations, 🔹 𝗪𝗶𝗱𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: 👉🏻Involves operations like groupByKey(), reduceByKey(), join(), cogroup(), and distinct(). 👉🏻Allows each input partition to contribute to multiple output partitions. 👉🏻Requires data shuffling and movement across partitions, often ...

𝗣𝗿𝗼𝗱𝘂𝗰𝘁 𝗖𝗮𝘁𝗲𝗴𝗼𝗿𝘆-Pandas-Easy

Image
𝗣𝗿𝗼𝗱𝘂𝗰𝘁 𝗖𝗮𝘁𝗲𝗴𝗼𝗿𝘆  🔶You are provided with a table named Products containing information about various products, including their names and prices. Write a code on Pandas to count number of products in each category based on its price into three categories below. Display the output in descending order of no of products.    🔶1- "Low Price" for products with a price less than 100  🔶2- "Medium Price" for products with a price between 100 and 500 (inclusive)  🔶3- "High Price" for products with a price greater than 500. import pandas as pd import numpy as np products_df["category"] = np.where( products_df['price'].between(0, 100, inclusive="neither"), "Low Price", np.where( products_df["price"].between(100, 500, inclusive="both"), "Medium Price", np.where( products_df["price"] > 500, "High Price", "Unknown" ...

𝗘𝗹𝗲𝗰𝘁𝗿𝗶𝗰𝗶𝘁𝘆 𝗖𝗼𝗻𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻-Pandas- Esy

Image
𝗘𝗹𝗲𝗰𝘁𝗿𝗶𝗰𝗶𝘁𝘆 𝗖𝗼𝗻𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻  🔶You have access to data from an electricity billing system, detailing the electricity usage and cost for specific households over billing periods in the years 2023 and 2024. Your objective is to present the total electricity consumption, total cost and average monthly consumption for each household per year display the output in ascending order of each household id & year of the bill. 📌𝗣𝗮𝗻𝗱𝗮𝘀 𝗰𝗼𝗱𝗲: 🔶𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵-𝟭- import pandas as pd import datetime as dt #Converting to datetime electricity_bill_df["billing_period"]=pd.to_datetime(electricity_bill_df["billing_period"]) #extracting the year electricity_bill_df["bill_year"]=electricity_bill_df["billing_period"].dt.year #Aggregrating the data electricity_bill_df_ans=electricity_bill_df.groupby(["household_id", "bill_year"]).agg(     total_cost=("total_cost","sum"),     consumption_kwh=(...