Download Data Engineering on Microsoft Azure (beta).DP-203.Dump4Pass.2022-04-07.199q.vcex

Vendor: Microsoft
Exam Code: DP-203
Exam Name: Data Engineering on Microsoft Azure (beta)
Date: Apr 07, 2022
File Size: 13 MB

How to open VCEX files?

Files with VCEX extension can be opened by ProfExam Simulator.

ProfExam Discount

Demo Questions

Question 1
You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.  
You create a table by using the Transact-SQL statement shown in the following exhibit.  
        
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.  
NOTE: Each correct selection is worth one point. 
Correct answer: To work with this question, an Exam Simulator is required.
Explanation:
Box 1: Type 2 A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example, IsCurrent) to easily filter by current dimension members.    Incorrect Answers: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.    Box 2: a business key A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order item line number within a sales order details table.    Reference: https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types  
Box 1: Type 2 
A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example, IsCurrent) to easily filter by current dimension members.  
  
Incorrect Answers: 
A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.  
  
Box 2: a business key 
A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order item line number within a sales order details table.  
  
Reference: 
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types  
Question 2
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.  
  
        
  
FactPurchase will have 1 million rows of data added daily and will contain three years of data.  
Transact-SQL queries similar to the following query will be executed daily.  
SELECT  
SupplierKey, StockItemKey, COUNT(*)  
FROM FactPurchase  
WHERE DateKey >= 20210101  
AND DateKey <= 20210131  
GROUP By SupplierKey, StockItemKey  
  
Which table distribution will minimize query times?
  1. replicated
  2. hash-distributed on PurchaseKey
  3. round-robin
  4. hash-distributed on DateKey
Correct answer: B
Explanation:
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article.  Round-robin tables are useful for improving loading speed.    Incorrect: Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.    Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute  
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article.  
Round-robin tables are useful for improving loading speed.  
  
Incorrect: 
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.  
  
Reference: 
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute  
Question 3
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.  
  
        
  
You need to alter the table to meet the following requirements:   
  • Ensure that users can identify the current manager of employees.  
  • Support creating an employee reporting hierarchy for your entire company.  
  • Provide fast lookup of the managers’ attributes such as name and job title.  
Which column should you add to the table?
  1. [ManagerEmployeeID] [int] NULL 
  2. [ManagerEmployeeID] [smallint] NULL
  3. [ManagerEmployeeKey] [int] NULL 
  4. [ManagerName] [varchar](200) NULL
Correct answer: C
Explanation:
Use the same definition as the EmployeeID column.    Reference: https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular 
Use the same definition as the EmployeeID column.  
  
Reference: 
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular 
Question 4
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.    
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.    
CREATE TABLE mytestdb.myParquetTable(  
EmployeeID int,  
EmployeeName string,  
EmployeeStartDate date)  
USING Parquet  
  
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.  
        
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.  
SELECT EmployeeID  
FROM mytestdb.dbo.myParquetTable  
WHERE name = 'Alice';  
What will be returned by the query?
  1. 24
  2. an error
  3. a null value
Correct answer: B
Explanation:
Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions.  Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear.  Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table 
Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions.  
Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear.  
Reference: 
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table 
Question 5
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36 months and has the following characteristics: 
  • Is partitioned by month  
  • Contains one billion rows  
  • Has clustered columnstore indexes  
At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.  
Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.   
Correct answer: To work with this question, an Exam Simulator is required.
Explanation:
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact. Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work. SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries and that the table definitions match.    Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the new data.  Step 3: Drop the SalesFact_Work table. Reference: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition 
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact. 
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work. 
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries and that the table definitions match.  
  
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the new data.  
Step 3: Drop the SalesFact_Work table. 
Reference: 
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition 
Question 6
You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.  
  
        
   
You create an external table named ExtTable that has LOCATION='/topfolder/'.    
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
  1. File2.csv and File3.csv only 
  2. File1.csv and File4.csv only
  3. File1.csv, File2.csv, File3.csv, and File4.csv
  4. File1.csv only
Correct answer: B
Explanation:
To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder or a pattern (using wildcards) over a set of files or folders.    Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-data-storage#query-multiple-files-or-folders  
To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder or a pattern (using wildcards) over a set of files or folders.  
  
Reference: 
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-data-storage#query-multiple-files-or-folders  
Question 7
You are planning the deployment of Azure Data Lake Storage Gen2.  
You have the following two reports that will access the data lake: 
  • Report1: Reads three columns from a file that contains 50 columns. 
  • Report2: Queries a single record based on a timestamp.   
You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.    
What should you recommend for each report? To answer, select the appropriate options in the answer area.    
NOTE: Each correct selection is worth one point. 
Correct answer: To work with this question, an Exam Simulator is required.
Explanation:
Report1: CSV CSV: The destination writes records as delimited data.   Report2: AVRO AVRO supports timestamps.  Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2. Reference: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html
Report1: CSV 
CSV: The destination writes records as delimited data. 
  
Report2: AVRO 
AVRO supports timestamps.  
Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2. 
Reference: 
https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html
Question 8
You are designing the folder structure for an Azure Data Lake Storage Gen2 container.    
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current year or current month.    
Which folder structure should you recommend to support fast queries and simplified folder security? 
  1. /{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}_{YYYY}_{MM}_{DD}.csv
  2. /{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv 
  3. /{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv
  4. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}_{YYYY}_{MM}_{DD}.csv
Correct answer: D
Explanation:
There's an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions.  Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went on.    Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans acrossnumerous products, devices, organizations, and customers. It’s important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout:   {Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/  
There's an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions.  
Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went on.  
  
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans acrossnumerous products, devices, organizations, and customers. It’s important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout:   
{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/  
Question 9
You need to output files from Azure Data Factory.    
Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.    
NOTE: Each correct selection is worth one point. 
Correct answer: To work with this question, an Exam Simulator is required.
Explanation:
Box 1: Parquet Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.    Box 2: Avro An Avro schema is created using JSON format.  AVRO supports timestamps.    Note: Azure Data Factory supports the following file formats (not GZip or TXT). Avro format  Binary format  Delimited text format  Excel format  JSON format  ORC format  Parquet format  XML format    Reference: https://www.datanami.com/2018/05/16/big-data-file-formats-demystified 
Box 1: Parquet 
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.  
  
Box 2: Avro 
An Avro schema is created using JSON format.  
AVRO supports timestamps.  
  
Note: Azure Data Factory supports the following file formats (not GZip or TXT). 
Avro format  
Binary format  
Delimited text format  
Excel format  
JSON format  
ORC format  
Parquet format  
XML format  
  
Reference: 
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified 
Question 10
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.    
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.    
You need to move the files to a different folder and transform the data to meet the following requirements:   
  • Provide the fastest possible query times.  
  • Automatically infer the schema from the underlying files.    
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.    
NOTE: Each correct selection is worth one point. 
Correct answer: To work with this question, an Exam Simulator is required.
Explanation:
Box 1: Preserver herarchy Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.    Box 2: Parquet Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.  Parquet supports the schema property.    Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction   https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
Box 1: Preserver herarchy 
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.  
  
Box 2: Parquet 
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.  
Parquet supports the schema property.  
  
Reference: 
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction   
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
HOW TO OPEN VCE FILES

Use VCE Exam Simulator to open VCE files
Avanaset

HOW TO OPEN VCEX AND EXAM FILES

Use ProfExam Simulator to open VCEX and EXAM files
ProfExam Screen

ProfExam
ProfExam at a 20% markdown

You have the opportunity to purchase ProfExam at a 20% reduced price

Get Now!