Exam 18: Professional Data Engineer on Google Cloud Platform

arrow
  • Select Tags
search iconSearch Question
flashcardsStudy Flashcards
  • Select Tags

Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

Free
(Multiple Choice)
4.7/5
(29)
Correct Answer:
Verified

C

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?

Free
(Multiple Choice)
4.9/5
(38)
Correct Answer:
Verified

B

You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants. What should you do?

Free
(Multiple Choice)
4.8/5
(34)
Correct Answer:
Verified

B

Your company's customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What should you do?

(Multiple Choice)
5.0/5
(32)

You are developing an application on Google Cloud that will automatically generate subject labels for users' blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?

(Multiple Choice)
4.8/5
(27)

By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?

(Multiple Choice)
4.8/5
(26)

You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.)

(Multiple Choice)
4.7/5
(37)

Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.

(Multiple Choice)
4.9/5
(35)

You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

(Multiple Choice)
4.8/5
(33)

You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to BigQuery. The second ingests data from on-premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed. What should you do?

(Multiple Choice)
4.9/5
(35)

Flowlogistic Case Study Company Overview Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping. Company Background The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources. Solution Concept Flowlogistic wants to implement two concepts using the cloud: Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed. Existing Technical Environment Flowlogistic architecture resides in a single data center: Databases 8 physical servers in 2 clusters - SQL Server - user data, inventory, static data 3 physical servers - Cassandra - metadata, tracking messages 10 Kafka servers - tracking message aggregation and batch insert Application servers - customer front end, middleware for order/customs 60 virtual machines across 20 physical servers - Tomcat - Java services - Nginx - static content - Batch servers Storage appliances - iSCSI for virtual machine (VM) hosts - Fibre Channel storage area network (FC SAN) - SQL server storage - Network-attached storage (NAS) image storage, logs, backups 10 Apache Hadoop /Spark servers - Core Data Lake - Data analysis workloads 20 miscellaneous servers - Jenkins, monitoring, bastion hosts, Business Requirements Build a reliable and reproducible environment with scaled panty of production. Aggregate data in a centralized Data Lake for analysis Use historical data to perform predictive analytics on future shipments Accurately track every shipment worldwide using proprietary technology Improve business agility and speed of innovation through rapid provisioning of new resources Analyze and optimize architecture for performance in the cloud Migrate fully to the cloud if all other requirements are met Technical Requirements Handle both streaming and batch data Migrate existing Hadoop workloads Ensure architecture is scalable and elastic to meet the changing demands of the company. Use managed services whenever possible Encrypt data flight and at rest Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around. We need to organize our information so we can more easily understand where our customers are and what they are shipping. CTO Statement IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology. CFO Statement Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment. Flowlogistic's CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they've purchased a visualization tool to simplify the creation of BigQuery reports. However, they've been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

(Multiple Choice)
4.9/5
(27)

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

(Multiple Choice)
4.7/5
(26)

Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)

(Multiple Choice)
4.9/5
(38)

You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster. Which two actions can you take to accomplish this? Choose 2 answers.

(Multiple Choice)
5.0/5
(40)

Why do you need to split a machine learning dataset into training data and test data?

(Multiple Choice)
4.9/5
(41)

Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have asked you to fix the problem. You want to maximize transfer speeds. Which action should you take?

(Multiple Choice)
4.9/5
(29)

What are two methods that can be used to denormalize tables in BigQuery?

(Multiple Choice)
4.8/5
(24)

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling. Which Google database service should you use?

(Multiple Choice)
4.8/5
(40)

You need to create a near real-time inventory dashboard that reads the main inventory tables in your BigQuery data warehouse. Historical inventory data is stored as inventory balances by item and location. You have several thousand updates to inventory every hour. You want to maximize performance of the dashboard and ensure that the data is accurate. What should you do?

(Multiple Choice)
5.0/5
(39)

You need to choose a database for a new project that has the following requirements: Fully managed Able to automatically scale up Transactionally consistent Able to scale up to 6 TB Able to be queried using SQL Which database do you choose?

(Multiple Choice)
4.8/5
(37)
Showing 1 - 20 of 256
close modal

Filters

  • Essay(0)
  • Multiple Choice(0)
  • Short Answer(0)
  • True False(0)
  • Matching(0)