Thursday, October 21, 2021

BIG DATA & HADOOP Intreview Question

BIG DATA & HADOOP Interview Question 


Question: What is MapReduce in Hadoop

Ans:
Ø MapReduce is a processing technique and a program model for distributed computing based on java.
Ø The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Ø Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs)

Q>What is MapReduce Shuffling and Sorting in Hadoop.
Ans:
Ø Shuffling is the process by which it transfers mappers intermediate output to the reducer.
Ø Reducer gets 1 or more keys and associated values on the basis of reducers.
Ø The intermediated key – value generated by mapper is sorted automatically by key. In Sort phase merging and sorting of map output takes place.
Ø Shuffling and Sorting in Hadoop occurs simultaneously.

#bigdatatechnologies #hadoopdeveloper #datacleaning #dataanalytics #learndatascience #learningeveryday


Question>How many Reducers run for a MapReduce job?
Ans:
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate


Question >What is HIVE?

ANS:
•HIVE is a query interface on top of Hadoop’s native Map-Reduce

•HIVE is a data warehouse

•HIVE allows users to write SQL style queries in a native language known as Hive Query Language (HQL)

•HIVE execution engine converts the scripts written in HQL into JAR files (map reduce) to execute in the cluster

•HIVE reads data from HDFS

•Allows creation of tables to operate on structured data

•The table’s schema information (table meta data) is saved in HIVE metastore which is borrowed from an RDBMS (Derby is default database)

•HIVE is not an RDBMS 

No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...