Hadoop Screening Q&A

4 min readSep 12, 2018

Hadoop

What is Edge node?
In a Hadoop cluster, three types of nodes exist: master, worker and edge nodes.

Edge nodes are the interface between the Hadoop cluster and the outside network.

They usually act as gateways for end-users to reach the worker nodes through applications like HiveServer2 servers, Impala LoadBalancer (Proxy server for Impala Daemons), Flume agents, config files and web interfaces like HttpFS, Oozie servers, Hue servers etc.

What is Speculative Execution in Hadoop?
MapReduce breaks jobs into tasks and these tasks run parallel.

If there happens to be slowness in any of the launched tasks, Hadoop doesn’t try to diagnose and fix slow running tasks, instead, it tries to detect them and runs backup tasks for them.

Whichever of the duplicate tasks (backup or slowdown task) finishes first becomes the one that is used in further operations.

What is high availability in Hadoop?
The HDFS High Availability feature addresses Single Point of Failure problem by providing the option of running two redundant NameNodes in the same cluster as Active and Passive NameNodes.

This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

Sqoop

Using Sqoop, how will you ingest data from a RDBMS table without a primary key?
Either we have to set the mapper to 1 or using “split-by” option.

How one can decide a column to be used for “split-by” option?
Choose a column with a low cardinality.

What are the types of incremental imports in Sqoop?
Incremental append and Incremental last modified.

What is the significance of “check-column” in Sqoop import?
It’s a mandatory argument/option to be given in case of incremental imports.

How one can decide a column to be used for “check-column” option for incremental append?
For incremental append, import will be based on last value which has to be a unique numeric column.

How one can decide a column to be used for “check-column” option for incremental last modified?
For incremental last modified, import will be based on latest updated audit columns, for instance, “last-updated-timestamp” or “last-updated-date”.

Give a situation to use “autoreset-to-one-mapper” option?
It is a best practice to use during “sqoop-import-all-tables” to avoid ingestion failure if any of the tables to be ingested doesn’t have a primary key.

What is the significance of “direct” option in Sqoop import?
“direct” option improves ingestion performance of databases that support direct connectors.

During Sqoop export, will the data in the HDFS get inserted or updated in the RDBMS table?
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table.

What is the significance of using “staging-table” option in Sqoop export?
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database.

We can overcome this problem by specifying a staging table via the “staging-table” option which acts as an auxiliary table that is used to stage exported data — the staged data is finally moved to the destination table in a single transaction.

Hive & Impala

What’s the significance of external table in Hive?
Data in external table can’t be deleted using “drop” or “truncate” statement.

When an external table is preferred?
The underlying data of external table can be managed by multiple teams/applications.

Since the same data being shared by multiple teams/applications, it avoids the need to maintain separate copies of data for each team/application.

What is the default location for managed table?
In an HDFS directory — /user/hive/warehouse.

Can we establish relationship between tables using primary/foriegn key in Hive?
No. Hive is not a replacement for traditional RDBMS and it does not validate primary and foreign key constraints.

However, the recent version of Hive includes support for non-validated primary and foreign key constraints.

Is MapReduce required for Impala? Will Impala continue to work as expected if MapReduce is stopped?
Impala does not use MapReduce at all. It uses MPP (Massive Parallel Processing).

What’s purpose of Hive partition and what basis the partition column is selected?
Partitions avoid full table scan which improves the performance of query retrieval.

Column with low cardinality is usually preferred for partition column.

Difference between INVALIDATE METADATA and REFRESH?
INVALIDATE METADATA waits to reload all the metadata for the table which can be an expensive operation, especially for large tables with many partitions.

REFRESH incrementally reloads the metadata. Note that it immediately loads the block location data for newly added data files, making it a less expensive operation overall.

How can we update the existing data in Hive or Impala?
Use INSERT OVERWRITE to overwrite any existing data in the table or partition.

Oozie

What are all the actions can be performed in Oozie?
Email Action
Hive Action
Shell Action
Ssh Action
Sqoop Action
Writing a custom Action Executor

What are the different states of an Apache Oozie Workflow job?
PREP
RUNNING
SUSPENDED
SUCCEEDED
KILLED
FAILED

What is Workflow, Coordinator and Bundle in Oozie?
Workflow — Represents a sequence of actions to be executed.
Coordinator — Consists of workflow jobs triggered by time and data availability.
Bundle — These can be referred to as a package of multiple coordinator and workflow jobs

Hadoop Screening Q&A

Written by Senthil Nayagan