Question 1

Which methods can be used to reduce the number of rows processed by BigQuery?

Accepted Answer

A)  Splitting tables into multiple tables; putting data in partitions 
B)  Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause 
C)  Putting data in partitions; using the LIMIT clause 
D)  Splitting tables into multiple tables; using the LIMIT clause 
A)  Splitting tables into multiple tables; putting data in partitions 
B)  Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause 
C)  Putting data in partitions; using the LIMIT clause 
D)  Splitting tables into multiple tables; using the LIMIT clause

Question 2

MJTelco Case Study Company Overview MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware. Company Background Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs. Solution Concept MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs: Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations. Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition. MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers. Business Requirements Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. Ensure security of their proprietary data to protect their leading-edge machine learning and analysis. Provide reliable and timely access to data for analysis from distributed research workers Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers. Technical Requirements Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each. Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles. CEO Statement Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments. CTO Statement Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate. CFO Statement The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines. MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Accepted Answer

A)  The zone 
B)  The number of workers 
C)  The disk size per worker 
D)  The maximum number of workers 
A)  The zone 
B)  The number of workers 
C)  The disk size per worker 
D)  The maximum number of workers

Question 3

You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?

Accepted Answer

A)  Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol. 
B)  Change the row key syntax in your Cloud Bigtable table to begin with a random number per second. 
C)  Change the data pipeline to use BigQuery for storing stock trades, and update your application. 
D)  Use Cloud Dataflow to write summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses. 
A)  Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol. 
B)  Change the row key syntax in your Cloud Bigtable table to begin with a random number per second. 
C)  Change the data pipeline to use BigQuery for storing stock trades, and update your application. 
D)  Use Cloud Dataflow to write summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.

Question 4

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

Accepted Answer

A)  Denormalize the data as must as possible. 
B)  Preserve the structure of the data as much as possible. 
C)  Use BigQuery UPDATE to further reduce the size of the dataset. 
D)  Develop a data pipeline where status updates are appended to BigQuery instead of updated. 
E)  Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query. 
A)  Denormalize the data as must as possible. 
B)  Preserve the structure of the data as much as possible. 
C)  Use BigQuery UPDATE to further reduce the size of the dataset. 
D)  Develop a data pipeline where status updates are appended to BigQuery instead of updated. 
E)  Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

Question 5

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally. Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?

Accepted Answer

A)  Edge TPUs as sensor devices for storing and transmitting the messages. 
B)  Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages. 
C)  An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub. 
D)  A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world. 
A)  Edge TPUs as sensor devices for storing and transmitting the messages. 
B)  Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages. 
C)  An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub. 
D)  A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.

Question 6

You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?

Accepted Answer

A)  Cloud SQL 
B)  Cloud Bigtable 
C)  Cloud Spanner 
D)  Cloud Datastore 
A)  Cloud SQL 
B)  Cloud Bigtable 
C)  Cloud Spanner 
D)  Cloud Datastore

Question 7

All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud Bigtable node.

Accepted Answer

A)  before 
B)  after 
C)  only if 
D)  once 
A)  before 
B)  after 
C)  only if 
D)  once

Question 8

Which software libraries are supported by Cloud Machine Learning Engine?

Accepted Answer

A)  Theano and TensorFlow 
B)  Theano and Torch 
C)  TensorFlow 
D)  TensorFlow and Torch 
A)  Theano and TensorFlow 
B)  Theano and Torch 
C)  TensorFlow 
D)  TensorFlow and Torch

Question 9

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

Accepted Answer

A)  PCollection 
B)  Transform 
C)  Pipeline 
D)  Sink API 
A)  PCollection 
B)  Transform 
C)  Pipeline 
D)  Sink API

Question 10

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

Accepted Answer

A)  Eliminate features that are highly correlated to the output labels. 
B)  Combine highly co-dependent features into one representative feature. 
C)  Instead of feeding in each feature individually, average their values in batches of 3. 
D)  Remove the features that have null values for more than 50% of the training records. 
A)  Eliminate features that are highly correlated to the output labels. 
B)  Combine highly co-dependent features into one representative feature. 
C)  Instead of feeding in each feature individually, average their values in batches of 3. 
D)  Remove the features that have null values for more than 50% of the training records.

Question 11

Flowlogistic Case Study Company Overview Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping. Company Background The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources. Solution Concept Flowlogistic wants to implement two concepts using the cloud: Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed. Existing Technical Environment Flowlogistic architecture resides in a single data center: Databases 8 physical servers in 2 clusters - SQL Server - user data, inventory, static data 3 physical servers - Cassandra - metadata, tracking messages 10 Kafka servers - tracking message aggregation and batch insert Application servers - customer front end, middleware for order/customs 60 virtual machines across 20 physical servers - Tomcat - Java services - Nginx - static content - Batch servers Storage appliances -  iSCSI for virtual machine (VM) hosts -  Fibre Channel storage area network (FC SAN) - SQL server storage -  Network-attached storage (NAS) image storage, logs, backups 10 Apache Hadoop /Spark servers - Core Data Lake - Data analysis workloads 20 miscellaneous servers - Jenkins, monitoring, bastion hosts, Business Requirements Build a reliable and reproducible environment with scaled panty of production. Aggregate data in a centralized Data Lake for analysis Use historical data to perform predictive analytics on future shipments Accurately track every shipment worldwide using proprietary technology Improve business agility and speed of innovation through rapid provisioning of new resources Analyze and optimize architecture for performance in the cloud Migrate fully to the cloud if all other requirements are met Technical Requirements Handle both streaming and batch data Migrate existing Hadoop workloads Ensure architecture is scalable and elastic to meet the changing demands of the company. Use managed services whenever possible Encrypt data flight and at rest Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around. We need to organize our information so we can more easily understand where our customers are and what they are shipping. CTO Statement IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology. CFO Statement Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment. Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Accepted Answer

A)  Store the common data in BigQuery as partitioned tables. 
B)  Store the common data in BigQuery and expose authorized views. 
C)  Store the common data encoded as Avro in Google Cloud Storage. 
D)  Store he common data in the HDFS storage for a Google Cloud Dataproc cluster. 
A)  Store the common data in BigQuery as partitioned tables. 
B)  Store the common data in BigQuery and expose authorized views. 
C)  Store the common data encoded as Avro in Google Cloud Storage. 
D)  Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Question 12

You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?

Accepted Answer

A)  BigQuery 
B)  Cloud Bigtable 
C)  Cloud Datastore 
D)  Cloud SQL for PostgreSQL 
A)  BigQuery 
B)  Cloud Bigtable 
C)  Cloud Datastore 
D)  Cloud SQL for PostgreSQL

Question 13

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.

Accepted Answer

A)  zone 
B)  node 
C)  label 
D)  type 
A)  zone 
B)  node 
C)  label 
D)  type

Question 14

You are developing an application that uses a recommendation engine on Google Cloud. Your solution should display new videos to customers based on past views. Your solution needs to generate labels for the entities in videos that the customer has viewed. Your design must be able to provide very fast filtering suggestions based on data from other customer preferences on several TB of data. What should you do?

Accepted Answer

A)  Build and train a complex classification model with Spark MLlib to generate labels and filter the results. Deploy the models using Cloud Dataproc. Call the model from your application. 
B)  Build and train a classification model with Spark MLlib to generate labels. Build and train a second classification model with Spark MLlib to filter results to match customer preferences. Deploy the models using Cloud Dataproc. Call the models from your application. 
C)  Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user's viewing history to generate preferences. 
D)  Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud SQL, and join and filter the predicted labels to match the user's viewing history to generate preferences. 
A)  Build and train a complex classification model with Spark MLlib to generate labels and filter the results. Deploy the models using Cloud Dataproc. Call the model from your application. 
B)  Build and train a classification model with Spark MLlib to generate labels. Build and train a second classification model with Spark MLlib to filter results to match customer preferences. Deploy the models using Cloud Dataproc. Call the models from your application. 
C)  Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user's viewing history to generate preferences. 
D)  Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud SQL, and join and filter the predicted labels to match the user's viewing history to generate preferences.

Question 15

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error. SELECT person FROM `project1.example.table1` WHERE city = "London" How would you correct the error?

Accepted Answer

A)  Add ", UNNEST(person)" before the WHERE clause. 
B)  Change "person" to "person.city". 
C)  Change "person" to "city.person". 
D)  Add ", UNNEST(city)" before the WHERE clause. 
A)  Add ", UNNEST(person)" before the WHERE clause. 
B)  Change "person" to "person.city". 
C)  Change "person" to "city.person". 
D)  Add ", UNNEST(city)" before the WHERE clause.

Question 16

You are building an application to share financial market data with consumers, who will receive data feeds. Data is collected from the markets in real time. Consumers will receive the data in the following ways: Real-time event stream ANSI SQL access to real-time stream and historical data Batch historical exports Which solution should you use?

Accepted Answer

A)  Cloud Dataflow, Cloud SQL, Cloud Spanner 
B)  Cloud Pub/Sub, Cloud Storage, BigQuery 
C)  Cloud Dataproc, Cloud Dataflow, BigQuery 
D)  Cloud Pub/Sub, Cloud Dataproc, Cloud SQL 
A)  Cloud Dataflow, Cloud SQL, Cloud Spanner 
B)  Cloud Pub/Sub, Cloud Storage, BigQuery 
C)  Cloud Dataproc, Cloud Dataflow, BigQuery 
D)  Cloud Pub/Sub, Cloud Dataproc, Cloud SQL

Question 17

You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do?

Accepted Answer

A)  Create an API using App Engine to receive and send messages to the applications 
B)  Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them 
C)  Create a table on Cloud SQL, and insert and delete rows with the job information 
D)  Create a table on Cloud Spanner, and insert and delete rows with the job information 
A)  Create an API using App Engine to receive and send messages to the applications 
B)  Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them 
C)  Create a table on Cloud SQL, and insert and delete rows with the job information 
D)  Create a table on Cloud Spanner, and insert and delete rows with the job information

Question 18

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?

Accepted Answer

A)  Add a SideInput that returns a Boolean if the element is corrupt. 
B)  Add a ParDo transform in Cloud Dataflow to discard corrupt elements. 
C)  Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data. 
D)  Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest. 
A)  Add a SideInput that returns a Boolean if the element is corrupt. 
B)  Add a ParDo transform in Cloud Dataflow to discard corrupt elements. 
C)  Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data. 
D)  Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

Question 19

You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query: SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country You check the query plan for the query and see the following output in the Read section of Stage:1:   What is the most likely cause of the delay for this query?

Accepted Answer

A)  Users are running too many concurrent queries in the system 
B)  The [myproject:mydataset.mytable] table has too many partitions The [myproject:mydataset.mytable] table has too many partitions 
C)  Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values 
D)  Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew Most rows in the table have the same value in the country column, causing data skew 
A)  Users are running too many concurrent queries in the system 
B)  The [myproject:mydataset.mytable] table has too many partitions The [myproject:mydataset.mytable] table has too many partitions 
C)  Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values 
D)  Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew Most rows in the table have the same value in the country column, causing data skew

Question 20

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

Accepted Answer

A)  The current epoch time 
B)  A concatenation of the product name and the current epoch time 
C)  A random universally unique identifier number (version 4 UUI
D)  
D)  The original order identification number from the sales system, which is a monotonically increasing integer 
A)  The current epoch time 
B)  A concatenation of the product name and the current epoch time 
C)  A random universally unique identifier number (version 4 UUI
D)  
D)  The original order identification number from the sales system, which is a monotonically increasing integer

Which methods can be used to reduce the number of rows processed by BigQuery?

You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?

All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud Bigtable node.

Which software libraries are supported by Cloud Machine Learning Engine?

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error. SELECT person FROM `project1.example.table1` WHERE city = "London" How would you correct the error?

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

Google AdWords: Display Advertising

Google AdWords Fundamentals

Associate Android Developer

Associate Cloud Engineer

Cloud Digital Leader

Google Analytics Individual Qualification (IQ)

Google Analytics Individual Qualification

GSuite

Looker Business Analyst

LookML Developer

Mobile Web Specialist

Professional Cloud Architect on Google Cloud Platform

Professional Cloud Developer

Professional Cloud DevOps Engineer

Professional Cloud Network Engineer

Professional Cloud Security Engineer

Professional Collaboration Engineer

Professional Machine Learning Engineer

Filters

Exam 18: Professional Data Engineer on Google Cloud Platform

Which methods can be used to reduce the number of rows processed by BigQuery?

You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?

All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud Bigtable node.

Which software libraries are supported by Cloud Machine Learning Engine?

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error. SELECT person FROM `project1.example.table1` WHERE city = "London" How would you correct the error?

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

Google AdWords: Display Advertising

Google AdWords Fundamentals

Associate Android Developer

Associate Cloud Engineer

Cloud Digital Leader

Google Analytics Individual Qualification (IQ)

Google Analytics Individual Qualification

GSuite

Looker Business Analyst

LookML Developer

Mobile Web Specialist

Professional Cloud Architect on Google Cloud Platform

Professional Cloud Developer

Professional Cloud DevOps Engineer

Professional Cloud Network Engineer

Professional Cloud Security Engineer

Professional Collaboration Engineer

Professional Machine Learning Engineer

Filters