ETL and Data Pipelines with Shell, Airflow and Kafka

Week 01 Quiz Answer

Graded Quiz: ETL and ELT Processes

1. ETL process consists of Extract > Transform > Load. Which of these three processes is also known as data wrangling? 

Load 
Extraction 
Data wrangling is a term for another data warehouse process 
Transform

2. The ELT process has no information loss. What is the main reason for this benefit?

Separates the data pipeline from processing
Data replication
Separation between moving and processing data
Data source integration

3. ETL processes include a storage facility called a staging area. In ELT the staging area fits the description of what?

Data mart
Electronic repository
Data lake
Data warehouse

4. Which of the following pain points does ELT address?

Lack of secure data
Challenges imposed by Big Data
Cost effectiveness
Request for fixed processes

5. There are many techniques for extracting data. Choosing the technique usually depends on what?

Intended use
Optical or analog
Operating system
Type of client

6. Extracting data from IoT devices involves large volumes of redundant data. What is used to decrease the data volume of redundant data and only extract features of interest from raw data?

Edge computing
SQL languages
APIs
Biometric sensors

7. ETL uses the schema-on-write approach and ELT uses the schema-on-read approach. What is the biggest difference in these two approaches?

Limited versatility vs. versatility
Consistency
Stability
More data access

8. Which of the following examples of information loss during transformation can involve false negatives?

Aggregation
Filtering
Lossy data compression
Edge computing

9. Which of the following loading techniques is between batch and stream loading?

On-demand loading
Incremental loading
Micro-batch loading
Parallel loading

10. Which of the following loading techniques can split a single file into smaller chunks?

Parallel loading
Stream loading
Batch loading
Scheduled loading

Week 02 Quiz Answer

Graded Quiz 01: ETL using Shell Scripts

1. What is the first stage of the ETL process?

Cleaning
Loading
Transformation
Extraction

2. Which of these transformations is correctly described?

Data Structuring: Fixing any errors or missing values
Sorting: selecting only what is needed
Normalizing: Converting data to common units
Cleaning: merging disparate data sources

3. Which of these is NOT an example of a system in the data load phase?

A scanned medical document
An Excel spreadsheet
A comma separated file
A data warehouse

4. Select the correct statement regarding ETL workflows as data pipelines.

Bottlenecks within the pipeline can often be handled by anonymizing slower tasks.
Data is fed through a data pipeline in large packets.
Overall accuracy of the ETL workflow has been a more important requirement than speed.
With conventional ETL pipelines data is processed in real time.

5. Select the correct statement regarding batch processing.

Batch processing triggers are rarely on demand.
Data is processed in batches, usually on a weekly schedule.
Batch processing intervals can be triggered by events.
When an event of interest occurs, such as an intruder alert, the interval would be periodic.

6. ETL pipelines are frequently used to integrate data from disparate and usually _____ systems within the enterprise.

siloed
batched
aggregating
simultaneous

7. Select the correct statement regarding Apache Airflow.

Apache Airflow represents the workflow in DAGs, but not in code.
Apache Airflow is a workflow orchestration tool.
Apache Airflow is a well-known commercial tool.
Apache Airflow tasks can be expressed as Python, but not Bash.

8. Bash uses _____ to turn your file into a Bash shell script.

loadstat
getstat
shebang
crontab

9. SSIS, Amazon Redshift, IBM InfoSphere Information Server, and Oracle GoldenGate are examples of _____.

Popular commercial ETL tools.
Popular commercial ELT tools.
Popular open-source ELT tools
Popular open-source ETL tools.

10. ETL jobs can be run on a schedule using _____.

shebang
crontab
loadstat
getstat

Graded Quiz 02: An Introduction to Data Pipelines

1. How does data flow through pipelines?

Processing threads
Files
Software processes
Data packets

2. Which of the following pipeline monitoring considerations affects the amount of data that passes through the pipeline over time?

Throughput
Latency
Utilization
Logging and alerting system

3. Which of the following data pipelines corresponds with the fraud detection use case?

Streaming data pipeline
Batch data pipeline
Micro-batch data pipeline
Lambda architectures

4. Which streaming data pipeline tool allows you to build applications using the Streams Processing Language (SPL)?

SQLstream
Apache Samza
Apache Spark
IBM Streams

5. Pipelines that incorporate parallelism are referred to as being_____ ?

Aligned
Linear
Dynamic or non-linear
Static

6. Batch data pipelines usually run periodically on fixed schedules. Which of the following is another method to run these?

Triggers
Error occurrence
Flags
Manually

7. Which of the following common features of modern ETL and ELT products is known as “no-code”?

Security
Data crawling
Drag-and-drop
Fully automated

8. Which of the following data pipeline use cases is the simplest?

File backup
Raw data preparation
Send/receive messages
Transactional record movement

9. Latency is the total time it takes for a single packet of data to pass through the pipeline. Which of the following limits latency?

Small data packets
Bad data
Data leak
Slowest process

10. Micro-batch data pipelines decrease the batch size. Which of the following do micro-batch pipelines increase?

Latency
Simple transformation
Storage
Batch process refresh rate

Week 03 Quiz Answer

Graded Quiz: Using Apache Airflow to build Data Pipelines

1. Apache Airflow pipelines are built on four main principles. Which of the following principles include parameterization?

Scalable
Extensible
Lean and explicit
Dynamic

2. Which of the following Apache Airflow use cases involves coordination of data in data warehouses?

Define machine learning pipeline dependencies
Decoupled batch processes
Scheduling tool
Orchestrate SQL transformation in data warehouses

3. Apache Airflow DAGs are a python script consisting of logical blocks. Which of the following logical blocks might use the ‘from airflow import DAG’ command?

Library imports
DAG definition
DAG arguments
Task pipeline

4. Sensors are a class of DAG operators. Which is another type of operator that defines DAG tasks?

Email
Python
Bash
All of the above

5. Which of the following advantages of Apache Airflow expressing workflows as code enables Git to track them?

Versionable
Testable
Maintainable
Collaborative

6. The ‘Task Instance Context Menu’ can be accessed from any of the DAG views that display what?

Tree view
Details
Task instances
Gantt

7. The final block in your Airflow pipeline script is where you specify the dependencies for your workflow. How do you specify the order of task 1 and task 2?

8. Which block specifies the DAG start date?

DAG definition
DAG arguments
Task definitions
Task pipeline

9. Which of the following Airflow metrics could fluctuate?

Timers
Gauges
Counters
None, they all can increase

10. Which of the following Apache Airflow basic components serves the interactive UI?

DAG directory
Executor
Scheduler
Web Server

Week 04 Quiz Answer

Graded Quiz: Using Apache Kafka to build Pipelines for Streaming Data

1. ESPs are a middle layer between multiple event sources and destinations. ESPs may have different architectures and components but also some common components. Which of the following common components receives and consumes events?

Analytic engine
Query engine
Event storage
Event broker

2. The core component of any ESP is the event broker. Which event broker sub-component performs encryption on data?

Storage
Processor
Consumption
Ingester

3. The Kafka server side is a cluster with many associated servers. What are the associated servers called?

Associates
Sub-servers
Brokers
Controllers

4. Which of the following Kafka main features provides consumption without a deadline?

Distribution system
Reliability
Open source
Permanent persistency

5. Which of the following Kafka core components publish events into topics?

Partitions
Producers
Consumers
Brokers

6. Which of the Kafka CLI script files manages topics?

Kafka-console-producer
Kafka-console-consumer
Kafka-console
Kafka-topics

7. Which of the following is Kafka Streams API based on?

Java
Gantt chart
Transformational graph
Computational graph

8. Which of the following do stream processors do?

Extracts, transforms, and loads
Extracts, loads, and transforms
Receives, transforms, and forwards
Processes and forwards

9. Kafka Streams API is based on a computational graph called a stream processing topology. And in the topology, each node is a stream processor, while edges are the I/O streams. In this topology we find two special types of processors: What are they called?

Aggregation and stream processor
Source and sink processor
Stream and topic processor
Mapping and transformation processor

10. Once events are published and properly stored in topic partitions, you can create _________ to read them.

Partitions
Consumers
Producers
Brokers

TeamsCloud

ETL and Data Pipelines with Shell, Airflow and Kafka

Week 01 Quiz Answer

Graded Quiz: ETL and ELT Processes

Week 02 Quiz Answer

Graded Quiz 01: ETL using Shell Scripts

Graded Quiz 02: An Introduction to Data Pipelines

Week 03 Quiz Answer

Graded Quiz: Using Apache Airflow to build Data Pipelines

Week 04 Quiz Answer

Graded Quiz: Using Apache Kafka to build Pipelines for Streaming Data

Post a Comment

Alibaba - ACA Big Data Certification Exam Answers

Project Management Capstone | Coursera Quiz Answers

Practice Exam for CAPM Certification | Coursera Quiz Answers

Global Acceleration Soluiton Overview(Exam) Answers

Security Solution on Container Service(Exam) Answers

TeamsCloud