CCD-410: Cloudera Certified Developer for Apache Hadoop (CCDH)

How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?

Keys are presented to reducer in sorted order; values for a given key are not sorted.
Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order.
Keys are presented to a reducer in random order; values for a given key are not sorted.
Keys are presented to a reducer in random order; values for a given key are sorted in ascending order.

Correct answer: A

Explanation:

Reducer has 3 primary phases:1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2. Sort The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. SecondarySort To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. 3. Reduce In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted. Reference: org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Assuming default settings, which best describes the order of data provided to a reducer’s reduce method:

The keys given to a reducer aren’t in a predictable order, but the values associated with those keys always are.
Both the keys and values passed to a reducer always appear in sorted order.
Neither keys nor values are in any predictable order.
The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order

Correct answer: D

Explanation:

Reducer has 3 primary phases:1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2. Sort The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. SecondarySort To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. 3. Reduce In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted. Reference: org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Reducer has 3 primary phases:

1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.

2. Sort

The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort

To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.

3. Reduce

In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.

The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object).

The output of the Reducer is not re-sorted.

Reference: org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed.

Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?

combine
map
init
configure

Correct answer: D

Explanation:

See 3) below. Here is an illustrative example on how to use the DistributedCache: // Setting up the cache for the application 1. Copy the requisite files to the FileSystem: $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz 2. Setup the application's JobConf: JobConf job = new JobConf(); DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job); DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job); DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job); DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job); DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job); DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job); 3. Use the cached files in the Mapper or Reducer: public static class MapClass extends MapReduceBase implements Mapper<K, V, K, V> { private Path[] localArchives; private Path[] localFiles; public void configure(JobConf job) { // Get the cached archives/files localArchives = DistributedCache.getLocalCacheArchives(job); localFiles = DistributedCache.getLocalCacheFiles(job); } public void map(K key, V value, OutputCollector<K, V> output, Reporter reporter) throws IOException { // Use data from the cached archives/files here // ... // ... output.collect(k, v); } } Reference: org.apache.hadoop.filecache , Class DistributedCache

See 3) below.

Here is an illustrative example on how to use the DistributedCache:

// Setting up the cache for the application

1. Copy the requisite files to the FileSystem:

$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat

$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip

$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar

$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar

$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz

$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz

2. Setup the application's JobConf:

JobConf job = new JobConf();

DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),

job);

DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);

DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);

DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);

DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);

DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);

3. Use the cached files in the Mapper

or Reducer:

public static class MapClass extends MapReduceBase

implements Mapper<K, V, K, V> {

private Path[] localArchives;

private Path[] localFiles;

public void configure(JobConf job) {

// Get the cached archives/files

localArchives = DistributedCache.getLocalCacheArchives(job);

localFiles = DistributedCache.getLocalCacheFiles(job);

}

public void map(K key, V value,

OutputCollector<K, V> output, Reporter reporter)

throws IOException {

// Use data from the cached archives/files here

// ...

output.collect(k, v);

}

Reference: org.apache.hadoop.filecache , Class DistributedCache

You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?

Ingest the server web logs into HDFS using Flume.
Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
Channel these clickstreams inot Hadoop using Hadoop Streaming.
Sample the weblogs from the web servers, copying them into Hadoop using curl.

Correct answer: B

Explanation:

Hadoop MapReduce for Parsing Weblogs Here are the steps for parsing a log file using Hadoop MapReduce:Load log files into the HDFS location using this Hadoop command:hadoop fs -put <local file path of weblogs> <hadoop HDFS location> The Opencsv2.3.jar framework is used for parsing log records. Below is the Mapper program for parsing the log file from the HDFS location. public static class ParseMapper extends Mapper<Object, Text, NullWritable,Text >{ private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { CSVParser parse = new CSVParser(' ','\"'); String sp[]=parse.parseLine(value.toString()); int spSize=sp.length; StringBuffer rec= new StringBuffer(); for(int i=0;i<spSize;i++){ rec.append(sp[i]); if(i!=(spSize-1)) rec.append(","); } word.set(rec.toString()); context.write(NullWritable.get(), word); } } The command below is the Hadoop-based log parse execution. TheMapReduce program is attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system. hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file> The output file is stored in the HDFS location, and the output file name starts with "part-".

Hadoop MapReduce for Parsing Weblogs

Here are the steps for parsing a log file using Hadoop MapReduce:

Load log files into the HDFS location using this Hadoop command:

hadoop fs -put <local file path of weblogs> <hadoop HDFS location>

The Opencsv2.3.jar framework is used for parsing log records.

Below is the Mapper program for parsing the log file from the HDFS location.

public static class ParseMapper

extends Mapper<Object, Text, NullWritable,Text >{

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

CSVParser parse = new CSVParser(' ','\"');

String sp[]=parse.parseLine(value.toString());

int spSize=sp.length;

StringBuffer rec= new StringBuffer();

for(int i=0;i<spSize;i++){

rec.append(sp[i]);

if(i!=(spSize-1))

rec.append(",");

}

word.set(rec.toString());

context.write(NullWritable.get(), word);

}

The command below is the Hadoop-based log parse execution. TheMapReduce program is attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system.

hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>

The output file is stored in the HDFS location, and the output file name starts with "part-".

MapReduce v2 (MRv2/YARN) is designed to address which two issues?

Single point of failure in the NameNode.
Resource pressure on the JobTracker.
HDFS latency.
Ability to run frameworks other than MapReduce, such as MPI.
Reduce complexity of the MapReduce APIs.
Standardize on a single MapReduce API.

Correct answer: BD

Explanation:

YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major kinds of benefits:* (D) The ability to use programming frameworks other than MapReduce. / MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative * Scalability, no matter what programming framework you use. Note: * The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. * (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:/ Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global. / Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job. The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will remove scalability bottlenecks Reference: Apache Hadoop YARN – Concepts & Applications

YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major kinds of benefits:

* (D) The ability to use programming frameworks other than MapReduce.

/ MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative

* Scalability, no matter what programming framework you use.

Note:

* The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

* (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

/ Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.

/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will remove scalability bottlenecks

Reference: Apache Hadoop YARN – Concepts & Applications

You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you’ve decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.

Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

hadoop “mapred.job.name=Example” MyDriver input output
hadoop MyDriver mapred.job.name=Example input output
hadoop MyDrive –D mapred.job.name=Example input output
hadoop setproperty mapred.job.name=Example MyDriver input output
hadoop setproperty (“mapred.job.name=Example”) MyDriver input output

Correct answer: C

Explanation:

Configure the property using the -D key=value notation:-D mapred.job.name='My Job' You can list a whole bunch of options by calling the streaming jar with just the -info argument Reference: Python hadoop streaming : Setting a job name

Configure the property using the -D key=value notation:

-D mapred.job.name='My Job'

You can list a whole bunch of options by calling the streaming jar with just the -info argument

Reference: Python hadoop streaming : Setting a job name

You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies (Text).

Indentify what determines the data types used by the Mapper for a given job.

The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
The data types specified in HADOOP_MAP_DATATYPES environment variable
The mapper-specification.xml file submitted with the job determine the mapper’s input key and value types.
The InputFormat used by the job determines the mapper’s input key and value types.

Correct answer: D

Explanation:

The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file. Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these. Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD

The input types fed to the mapper are controlled by the InputFormat used. The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.

Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().

By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.

Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD

Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?

ResourceManager
NodeManager
ApplicationMaster
ApplicationMasterService
TaskTracker
JobTracker

Correct answer: C

Explanation:

The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Note: Let’s walk through an application execution sequence :1.A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself. 2.The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster. 3.The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to directly communicate with its own ApplicationMaster. 4.During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-request protocol. 5.On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself. 6.The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol. 7.During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol. 8.Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed. Reference: Apache Hadoop YARN – Concepts & Applications

The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Note: Let’s walk through an application execution sequence :

1.A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself.

2.The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster.

3.The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to directly communicate with its own ApplicationMaster.

4.During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-request protocol.

5.On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself.

6.The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol.

7.During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol.

8.Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.

Reference: Apache Hadoop YARN – Concepts & Applications

Which best describes how TextInputFormat processes input files and line breaks?

Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
Input file splits may cross line breaks. A line that crosses file splits is ignored.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.

Correct answer: E

Explanation:

As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit. When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordReader it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, TextInputFormat ignores the content up to the first newline. Reference: How Map and Reduce operations are actually carried out

As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit.

When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordReader it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, TextInputFormat ignores the content up to the first newline.

Reference: How Map and Reduce operations are actually carried out

For each input key-value pair, mappers can emit:

As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
One intermediate key-value pair, of a different type.
One intermediate key-value pair, but of the same type.
As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

Correct answer: E

Explanation:

Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs. Reference: Hadoop Map-Reduce Tutorial

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

Reference: Hadoop Map-Reduce Tutorial

Vendor:	Cloudera
Exam Code:	CCD-410
Exam Name:	Cloudera Certified Developer for Apache Hadoop (CCDH)
Date:	Nov 16, 2018
File Size:	42 KB
Downloads:	1

Download Cloudera Certified Developer for Apache Hadoop (CCDH).CCD-410.PracticeTest.2018-11-16.30q.vcex

How to open VCEX files?

Demo Questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

ProfExam at a 20% markdown