AWS Certified Big Data - Specialty

Question#21

A company uses Amazon Redshift for its enterprise data warehouse. A new on-premises PostgreSQL OLTP
DB must be integrated into the data warehouse. Each table in the PostgreSQL DB has an indexed timestamp column. The data warehouse has a staging layer to load source data into the data warehouse environment for further processing.
The data lag between the source PostgreSQL DB and the Amazon Redshift staging layer should NOT exceed four hours.
What is the most efficient technique to meet these requirements?

A. Create a DBLINK on the source DB to connect to Amazon Redshift. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and execute the event on the Amazon Redshift staging table.
B. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and write it to Amazon Kinesis Streams. Use a KCL application to execute the event on the Amazon Redshift staging table.
C. Extract the incremental changes periodically using a SQL query. Upload the changes to multiple Amazon Simple Storage Service (S3) objects, and run the COPY command to load to the Amazon Redshift staging layer.
D. Extract the incremental changes periodically using a SQL query. Upload the changes to a single Amazon Simple Storage Service (S3) object, and run the COPY command to load to the Amazon Redshift staging layer.

Discover Answer

C

Question#22

An administrator is deploying Spark on Amazon EMR for two distinct use cases: machine learning algorithms and ad-hoc querying. All data will be stored in Amazon S3. Two separate clusters for each use case will be deployed. The data volumes on Amazon S3 are less than 10 GB.
How should the administrator align instance types with the clusters purpose?

A. Machine Learning on C instance types and ad-hoc queries on R instance types
B. Machine Learning on R instance types and ad-hoc queries on G2 instance types
C. Machine Learning on T instance types and ad-hoc queries on M instance types
D. Machine Learning on D instance types and ad-hoc queries on I instance types

Discover Answer

A

Question#23

An organization is designing an application architecture. The application will have over 100 TB of data and will support transactions that arrive at rates from hundreds per second to tens of thousands per second, depending on the day of the week and time of the day. All transaction data, must be durably and reliably stored. Certain read operations must be performed with strong consistency.
Which solution meets these requirements?

A. Use Amazon DynamoDB as the data store and use strongly consistent reads when necessary.
B. Use an Amazon Relational Database Service (RDS) instance sized to meet the maximum anticipated transaction rate and with the High Availability option enabled.
C. Deploy a NoSQL data store on top of an Amazon Elastic MapReduce (EMR) cluster, and select the HDFS High Durability option.
D. Use Amazon Redshift with synchronous replication to Amazon Simple Storage Service (S3) and row-level locking for strong consistency.

Discover Answer

A

Question#24

A company generates a large number of files each month and needs to use AWS import/export to move these files into Amazon S3 storage. To satisfy the auditors, the company needs to keep a record of which files were imported into Amazon S3.
What is a low-cost way to create a unique log for each import job?

A. Use the same log file prefix in the import/export manifest files to create a versioned log file in Amazon S3 for all imports.
B. Use the log file prefix in the import/export manifest files to create a unique log file in Amazon S3 for each import.
C. Use the log file checksum in the import/export manifest files to create a unique log file in Amazon S3 for each import.
D. Use a script to iterate over files in Amazon S3 to generate a log after each import/export job.

Discover Answer

B

Question#25

A company needs a churn prevention model to predict which customers will NOT renew their yearly subscription to the companys service. The company plans to provide these customers with a promotional offer. A binary classification model that uses Amazon Machine Learning is required.
On which basis should this binary classification model be built?

A. User profiles (age, gender, income, occupation)
B. Last user session
C. Each user time series events in the past 3 months
D. Quarterly results

Discover Answer

C

Question#26

A company with a support organization needs support engineers to be able to search historic cases to provide fast responses on new issues raised. The company has forwarded all support messages into an Amazon
Kinesis Stream. This meets a company objective of using only managed services to reduce operational overhead.
The company needs an appropriate architecture that allows support engineers to search on historic cases and find similar issues and their associated responses.
Which AWS Lambda action is most appropriate?

A. Ingest and index the content into an Amazon Elasticsearch domain.
B. Stem and tokenize the input and store the results into Amazon ElastiCache.
C. Write data as JSON into Amazon DynamoDB with primary and secondary indexes.
D. Aggregate feedback in Amazon S3 using a columnar format with partitioning.

Discover Answer

A

Question#27

A solutions architect works for a company that has a data lake based on a central Amazon S3 bucket. The data contains sensitive information. The architect must be able to specify exactly which files each user can access. Users access the platform through a SAML federation Single Sign On platform.
The architect needs to build a solution that allows fine grained access control, traceability of access to the objects, and usage of the standard tools (AWS Console, AWS CLI) to access the data.
Which solution should the architect build?

A. Use Amazon S3 Server-Side Encryption with AWS KMS-Managed Keys for storing data. Use AWS KMS Grants to allow access to specific elements of the platform. Use AWS CloudTrail for auditing.
B. Use Amazon S3 Server-Side Encryption with Amazon S3-Managed Keys. Set Amazon S3 ACLs to allow access to specific elements of the platform. Use Amazon S3 to access logs for auditing.
C. Use Amazon S3 Client-Side Encryption with Client-Side Master Key. Set Amazon S3 ACLs to allow access to specific elements of the platform. Use Amazon S3 to access logs for auditing.
D. Use Amazon S3 Client-Side Encryption with AWS KMS-Managed Keys for storing data. Use AWS KMS Grants to allow access to specific elements of the platform. Use AWS CloudTrail for auditing.

Discover Answer

D

Question#28

A company that provides economics data dashboards needs to be able to develop software to display rich, interactive, data-driven graphics that run in web browsers and leverages the full stack of web standards
(HTML, SVG, and CSS).
Which technology provides the most appropriate support for this requirements?

A. D3.js
B. IPython/Jupyter
C. R Studio
D. Hue

Discover Answer

A
Reference: https://sa.udacity.com/course/data-visualization-and-d3js--ud507

Question#29

A company hosts a portfolio of e-commerce websites across the Oregon, N. Virginia, Ireland, and Sydney
AWS regions. Each site keeps log files that capture user behavior. The company has built an application that generates batches of product recommendations with collaborative filtering in Oregon. Oregon was selected because the flagship site is hosted there and provides the largest collection of data to train machine learning models against. The other regions do NOT have enough historic data to train accurate machine learning models.
Which set of data processing steps improves recommendations for each region?

A. Use the e-commerce application in Oregon to write replica log files in each other region.
B. Use Amazon S3 bucket replication to consolidate log entries and build a single model in Oregon.
C. Use Kinesis as a buffer for web logs and replicate logs to the Kinesis stream of a neighboring region.
D. Use the CloudWatch Logs agent to consolidate logs into a single CloudWatch Logs group.

Discover Answer

D

Question#30

There are thousands of text files on Amazon S3. The total size of the files is 1 PB. The files contain retail order information for the past 2 years. A data engineer needs to run multiple interactive queries to manipulate the data. The Data Engineer has AWS access to spin up an Amazon EMR cluster. The data engineer needs to use an application on the cluster to process this data and return the results in interactive time frame.
Which application on the cluster should the data engineer use?

A. Oozie
B. Apache Pig with Tachyon
C. Apache Hive
D. Presto

Discover Answer

C