CCD-410: Cloudera Certified Developer for Apache Hadoop (CCDH)

Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?

ResourceManager
NodeManager
ApplicationMaster
ApplicationMasterService
TaskTracker
JobTracker

Correct answer: C

Explanation:

The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Note: Let’s walk through an application execution sequence :A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself. The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster. The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to directly communicate with its own ApplicationMaster. During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-request protocol. On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself. The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol. During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol. Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed. Reference: Apache Hadoop YARN – Concepts & Applications

The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Note: Let’s walk through an application execution sequence :

A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself.
The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster.
The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to directly communicate with its own ApplicationMaster.
During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-request protocol.
On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself.
The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol.
During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol.
Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.

Reference: Apache Hadoop YARN – Concepts & Applications

Which best describes how TextInputFormat processes input files and line breaks?

Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
Input file splits may cross line breaks. A line that crosses file splits is ignored.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.

Correct answer: E

Explanation:

As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit. When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordReader it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, TextInputFormat ignores the content up to the first newline. Reference: How Map and Reduce operations are actually carried out

As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit.

When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordReader it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, TextInputFormat ignores the content up to the first newline.

Reference: How Map and Reduce operations are actually carried out

For each input key-value pair, mappers can emit:

As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
One intermediate key-value pair, of a different type.
One intermediate key-value pair, but of the same type.
As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

Correct answer: E

Explanation:

Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs. Reference: Hadoop Map-Reduce Tutorial

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

Reference: Hadoop Map-Reduce Tutorial

You have the following key-value pairs as output from your Map task:

(the, 1)

(fox, 1)

(faster, 1)

(than, 1)

(the, 1)

(dog, 1)

How many keys will be passed to the Reducer’s reduce method?

Six
Five
Four
Two
One
Three

Correct answer: A

Explanation:

Only one key value pair will be passed from the two (The, 1) key value pairs.

You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records?

HDFS command
Pig LOAD command
Sqoop import
Hive LOAD DATA command
Ingest with Flume agents
Ingest with Hadoop Streaming

Correct answer: B

Explanation:

Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large Web logs. We use Pig scripts for sifting through the data and to extract useful information from the Web logs. We load the log file into Pig using the LOAD command. raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray);Note 1:Data Flow and Components * Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS. * Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale. * The database HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing. * The data-access Web service is a REST-based service that eases the access and integrations with data clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based clients. Note 2:The Log Analysis Software Stack * Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is mainly divided into two modules:1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster. 2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS. * The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets -- HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row. * The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language. * Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage. Reference: Hadoop and Pig for Large-Scale Web Log Analysis

Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large Web logs.

We use Pig scripts for sifting through the data and to extract useful information from the Web logs. We load the log file into Pig using the LOAD command.

raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray);

Note 1:

Data Flow and Components

* Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS.

* Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale.

* The database HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing.

* The data-access Web service is a REST-based service that eases the access and integrations with data clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based clients.

Note 2:

The Log Analysis Software Stack

* Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is mainly divided into two modules:

1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster.

2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS.

* The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets -- HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row.

* The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language.

* Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage.

Reference: Hadoop and Pig for Large-Scale Web Log Analysis

What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

You will not be able to compress the intermediate data.
You will longer be able to take advantage of a Combiner.
By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
There are no concerns with this approach. It is always advisable to use multiple reduces.

Correct answer: C

Explanation:

Multiple reducers and total ordering If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can’t concatenate your output files to create a single sorted output file. To do this you’ll need total ordering, Reference: Sorting text files with MapReduce

Multiple reducers and total ordering

If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r option to specify the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the HashPartitioner means that you can’t concatenate your output files to create a single sorted output file. To do this you’ll need total ordering,

Reference: Sorting text files with MapReduce

Given a directory of files with the following structure: line number, tab character, string:

Example:

abialkjfjkaoasdfjksdlkjhqweroij
kadfjhuwqounahagtnbvaswslmnbfgy
kjfteiomndscxeqalkzhtopedkfsikj

You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?

SequenceFileAsTextInputFormat
SequenceFileInputFormat
KeyValueFileInputFormat
BDBInputFormat

Correct answer: B

Explanation:

Note:The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat. Reference: How to parse CustomWritable from text in Hadoophttp://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop (see answer 1 and then see the comment #1 for it)

Note:

The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat.

Reference: How to parse CustomWritable from text in Hadoop

http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop (see answer 1 and then see the comment #1 for it)

You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?

Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
When submitting the job on the command line, specify the –libjars option followed by the JAR file path.
Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Correct answer: C

Explanation:

The usage of the jar command is like this, Usage: hadoop jar <jar> [mainClass] args... If you want the commons-math3.jar to be available for all the tasks you can do any one of these 1. Copy the jar file in $HADOOP_HOME/lib dir or 2. Use the generic option -libjars.

The usage of the jar command is like this,

Usage: hadoop jar <jar> [mainClass] args...

If you want the commons-math3.jar to be available for all the tasks you can do any one of

these

1. Copy the jar file in $HADOOP_HOME/lib dir

or

2. Use the generic option -libjars.

The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:

Combine
IdentityMapper
IdentityReducer
Default Partitioner
Speculative Execution

Correct answer: E

Explanation:

Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Reference: Apache Hadoop, Module 4: MapReduceNote:* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed. Failed tasks are tasks that error out. * There are a few reasons Hadoop can kill tasks by his own decisions: a) Task does not report progress during timeout (default is 10 minutes) b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler). c) Speculative execution causes results of task not to be needed since it has completed on other place. Reference: Difference failed tasks vs killed tasks

Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Reference: Apache Hadoop, Module 4: MapReduce

Note:

* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed.

Failed tasks are tasks that error out.

* There are a few reasons Hadoop can kill tasks by his own decisions:

a) Task does not report progress during timeout (default is 10 minutes)

b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler).

c) Speculative execution causes results of task not to be needed since it has completed on other place.

Reference: Difference failed tasks vs killed tasks

For each intermediate key, each reducer task can emit:

As many final key-value pairs as desired. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
One final key-value pair per value associated with the key; no restrictions on the type.
One final key-value pair per key; no restrictions on the type.

Correct answer: E

Explanation:

Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value. Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

Reducer reduces a set of intermediate values which share a key to a smaller set of values.

Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

Vendor:	Cloudera
Exam Code:	CCD-410
Exam Name:	Cloudera Certified Developer for Apache Hadoop (CCDH)
Date:	Sep 15, 2018
File Size:	44 KB
Downloads:	1

Download Cloudera Certified Developer for Apache Hadoop (CCDH).CCD-410.ExamsKey.2018-09-15.36q.vcex

How to open VCEX files?

Demo Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

ProfExam at a 20% markdown