![]() |
Explore the fundamentals of designing data lakes on AWS with the AWS Cloud Solutions Architect Professional Certificate. This comprehensive course delves into the intricacies of leveraging AWS services to build scalable and efficient data lakes, addressing key concepts such as data ingestion, storage, processing, and security. Ideal for professionals aiming to enhance their expertise in cloud architecture, this certification equips learners with practical skills to design robust data lake solutions using AWS technologies. Dive deep into AWS's capabilities and emerge ready to architect data lakes that meet modern data management challenges effectively.
Notice!
Always refer to the module on your course for the most accurate and up-to-date information.
Attention!
If you have any questions that are not covered in this post, please feel free to leave them in the comments section below. Thank you for your engagement.
Week 1 Quiz
- The ability to store user-generated data, such as data from antennas and sensors.
- The ability to define the data schema before ingesting and storing data.
- The ability to combine multiple databases together to expand their capacity and availability.
- The ability to ingest and store data that could be the answer for future questions when they are processed with the correct data processing mechanisms.
- True
- False
- Data lakes mostly process data after it has been stored in the cloud or on-premises.
- A data lake provides the most secure way to store data in the AWS Cloud.
- With a data lake, a company can store structured and unstructured data at virtually any scale.
- A data lake is a direct replacement of a data warehouse.
- Data lakes use schema-on-write architectures and data warehouses use schema-on-read architectures.
- Data lakes offer more choices in terms of the technology that is used for processing data. In contrast, data warehouses are more restricted to using Structured Query Language (SQL) as the query technology.
- The solutions architect can combine both data lakes and data warehouses to better extract insights and turn data into information.
- The solutions architect cannot attach data visualization tools to data warehouses.
- Data lakes are not future-proof, which means that they must be reconfigured each time new data is ingested.
- True
- False
- Data swamp
- Data warehouse
- Data catalog
- Database
- Amazon Athena
- Amazon Kinesis Data Firehose
- Amazon Simple Storage Service (Amazon S3)
- Amazon Kinesis Agent
Week 2 Quiz
- True
- False
- The AWS Glue Metadata Catalog contains buckets with different types of storage options. AWS Glue Metadata Catalog stores data as objects in these buckets.
- The AWS Glue Metadata Catalog is the storage that is associated with automated database backups and any active database snapshots. It consists of the General Purpose SSD, Provisioned IOPS SSD, Throughput Optimized HDD, and Cold HDD volume types.
- The AWS Glue Metadata Catalog consists of file systems or databases for any applications that require fine, granular updates and access to raw, unformatted, block-level storage.
- The AWS Glue Metadata Catalog consists of tables. Each table has a schema, which outlines the structure of a table, including columns, data type definitions, and more. The tables are organized into logical groups that are called databases.
- True
- False
- AWS Lake Formation registers the Amazon Simple Storage Service (Amazon S3) buckets and paths where the data lake will reside.
- AWS Lake Formation runs big data frameworks, such as Apache Hadoop.
- AWS Lake Formation ingests, cleanses, and transforms the structured and organized data.
- AWS Lake Formation deploys, operates, and scales clusters in the AWS Cloud
- Amazon Kinesis Data Analytics
- Amazon EMR
- Amazon Athena
- AWS Glue Jobs
- AWS Lambda
- Amazon Kinesis Data Analytics
- Amazon OpenSearch Service
- Amazon EMR
- Create an AWS Lambda function with the training logic in the handler, and run the training based on an event.
- Use a pretrained model from an AWS service, such as Amazon Rekognition.
- Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance and run Amazon SageMaker on it to train the model.
- Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance by using an AWS Deep Learning Amazon Machine Image (AMI) to host the application that will train the model.
- Amazon API Gateway
- Amazon EMR
- Amazon Kinesis
- AWS Lambda
- Amazon Athena
- AWS Lambda
- Amazon Glue
- Amazon Redshift
- Amazon Elastic Compute Cloud
Week 3 Quiz
- True
- False
- AWS account root user
- AWS Identity and Access Management (IAM) user
- AWS Identity and Access Management (IAM) role
- Access keys
- Structured data, unstructured data, and semi-structured data
- Ready data, not-ready data, and semi-ready data
- The good data, the bad data, and the ugly data
- Development data, quality assurance (QA) data, and production data
- AWS Snowcone
- AWS Snowmobile
- AWS Snowball
- AWS Glue
- An AWS Glue crawler collects and catalogs data from databases and object storage, moves the data into a new Amazon Simple Storage Service (Amazon S3) data lake, and classifies the data by using machine learning algorithms.
- An AWS Glue crawler can scan a data store, such as an Amazon Simple Storage Service (Amazon S3) bucket, and use the data from the data store to create or update tables in the AWS Glue Data Catalog.
- An AWS Glue crawler runs Structured Query Language (SQL) queries to analyze data directly in Amazon Simple Storage Service (Amazon S3).
- An AWS Glue crawler performs interactive log analytics, real-time application monitoring, website search, and more.
- True
- False
- Amazon Kinesis Data Streams stores data only in the JSON format.
- The Amazon Kinesis Family can ingest a high volume of small bits of data that are being processed in real time.
- By writing data consumers, customers can move data that is ingested into Amazon Kinesis Data Streams to an Amazon Simple Storage Service (Amazon S3) bucket with minimum modification.
- Amazon Kinesis Data Analytics loads data streams into AWS databases.
- Amazon Kinesis Data Analytics provides an option to author non-Structured Query Language (SQL) code to process and analyze streaming data.
- Account monitoring with AWS CloudTrail
- Log monitoring with Amazon CloudWatch
- Log analysis with Amazon Kinesis Family
- Log analysis with Amazon Pinpoint
Week 4 Quiz
- True
- False
- Transform data in real time as data comes into the data lake.
- Analyze data in batches on schedule or on demand.
- Transform data on a schedule or on demand.
- Analyze data in real time as data comes into the data lake.
- Compressed data uses a row-based data format that works well for data optimization.
- By using compressed data, data-processing systems can optimize for memory and cost.
- Compressed data slows the time to process and analyze information.
- Compressed data increases the risk of losing valuable information.
- AWS
- Customer
- Both AWS and the customer
- Third-party security company
- Raw data is generally formatted to be read and used by a human eye.
- Visualization data is always captured in a text editor.
- If there is more data, making sense of the data will be more difficult without using visualization tools.
- A click map is the main reason to invest into data visualization.
- The ability to visualize data
- The ability to create sharable dashboards
- Super-fast, Parallel, In-memory Calculation Engine (SPICE)
- Data encryption at every layer
- Help people discover and share datasets that are available outside of AWS resources.
- Help people discover and share datasets that are available through AWS resources.
- Provide a service that people can use to transform public datasets that are published by data providers through an API.
- Provide a service that people can use to ingest software as a service (SaaS) application data into a data lake.
- True
- False
Final Assessment
- The AWS Glue Metadata Catalog provides a repository where a company can store, find, and access metadata, and use that metadata to query and transform the data.
- The AWS Glue Metadata Catalog is a query service that uses standard Structured Query Language (SQL) to retrieve data.
- The AWS Glue Metadata Catalog provides a repository where a company can store and find metadata to keep track of user permissions to data in a data lake.
- The AWS Glue Metadata Catalog provides a data transformation service where a company can author and run scripts to transform data between data sources and targets.
- AWS Glue Metadata Catalog
- Amazon OpenSearch Service
- Amazon EMR
- Amazon Simple Storage Service (Amazon S3)
- Batch data ingestion is the process of capturing gigabytes (GB) of data per second from multiple sources, such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events.
- Batch data ingestion is the process of collecting and transferring large amounts of data that have already been produced and stored on premises or in the cloud.
- By using batch data ingestion, a user can create a unified metadata repository across various services on AWS.
- Batch data ingestion is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning, and application development.
- Amazon EMR
- Amazon Kinesis Data Analytics
- Amazon Athena
- AWS Glue job
- Amazon OpenSearch Service
- Amazon Kinesis Data Analytics
- Amazon EMR
- AWS Lambda
- Create an AWS Lambda function with the training logic in the handler, and run the training based on an event.
- Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance and run Amazon SageMaker on it to train the model.
- Use a pretrained model from an AWS service, such as Amazon Rekognition.
- Launch an Amazon Elastic Compute Cloud (Amazon EC2) instance by using an AWS Deep Learning Amazon Machine Image (AMI) to host the application that will train the model.
- The ability to define the data schema before ingesting and storing data.
- The ability to ingest and store data that could be the answer for future questions when they are processed with the correct data processing mechanisms.
- The ability to store user-generated data, such as data from antennas and sensors.
- The ability to combine multiple databases together to expand their capacity and availability.
- Data lakes use schema-on-write architectures and data warehouses use schema-on-read architectures.
- Data lakes offer more choices in terms of the technology that is used for processing data. In contrast, data warehouses are more restricted to using Structured Query Language (SQL) as the query technology.
- The solutions architect can combine both data lakes and data warehouses to better extract insights and turn data into information.
- The solutions architect cannot attach data visualization tools to data warehouses.
- Data lakes are not future-proof, which means that they must be reconfigured each time new data is ingested.
- Increase operational overhead
- Make data available from integrated departments
- Lower transactional costs
- Limit data movement
- Offload capacity from databases and data warehouses
- Data swamp
- Data warehouse
- Data catalog
- Database
- No, data lakes do not make it easier to follow “the right tool for the job approach” because you are tied to a specific AWS service.
- Yes, data lakes make it easier to follow “the right tool for the job” approach because storage can be decoupled from processing and ingestion.
- No, data lakes do not make it easier to follow “the right tool for the job approach” because data lakes can only handle structured data.
- Yes, data lakes make it easier to follow “the right tool for the job” approach because data lakes can only handle structured data.
- Analyze data in batches on schedule or on demand.
- Analyze data in real time as data comes into the data lake.
- Transform data on a schedule or on demand.
- Transform data in real time as data comes into the data lake.
- Store metadata in a catalog for indexing.
- Populate the AWS Glue Data Catalog with tables.
- Map data from one schema to another schema.
- Analyze all data in the data lake to create an Apache Hive metastore.
- AWS
- Customer
- Both AWS and the customer
- Third-party security company
- Super-fast, Parallel, In-memory Calculation Engine (SPICE)
- The ability to create sharable dashboards
- The ability to visualize data
- Data encryption at every layer
- Help people discover and share datasets that are available through AWS resources.
- Help people discover and share datasets that are available outside of AWS resources.
- Provide a service that people can use to transform public datasets that are published by data providers through an API.
- Provide a service that people can use to ingest software as a service (SaaS) application data into a data lake.
- Amazon Simple Storage Service (Amazon S3) is mostly used for storage, and AWS Glue is mostly used for categorizing data.
- Data lakes need to be schema-on-write. In this case, users need to transform all the data before they load it into the data lake.
- Data lakes are not future-proof, which means that they must be reconfigured each time new data is ingested.
- When cataloging data, it is a best practice to organize the data according to the access pattern of the user who will access it.
- Users must delete the original raw data to keep their data lake organized and cataloged.
- Customer reviews on products in retailer websites
- Data that is sitting in a relational MySQL table
- Video files from mobile phone photo libraries
- Raw data from marketing research surveys
- Ready data, not-ready data, and semi-ready data
- Development data, quality assurance (QA) data, and production data
- Structured data, unstructured data, and semi-structured data
- The good data, the bad data, and the ugly data
- If data is not consumed within 15 minutes, Kinesis will delete the data that was added to the stream. This case is true even though the data-retention window is greater than 15 minutes.
- If data is consumed by a consumer, that consumer can never get that same data again. This case is true even if the data is still in the stream, according to the data-retention window.
- Data consumers must use an AWS SDK to correctly fetch data from Kinesis in the same order that it was ingested. However, AWS Lambda functions do not need to fetch data from Kinesis in a specific order because Lambda integrates natively with AWS services, including Kinesis.
- Data is automatically pushed to each consumer that is connected to Kinesis. Thus, consumers are notified that new data is available, even when they are not running the Kinesis SDK for data consumption.