Google Professional Cloud Data Engineer PR000111 Practice Exam Part 1gcp-examquestions
Notes: Hi all, Google Professional Cloud Data Engineer Practice Exam will familiarize you with types of questions you may encounter on the certification exam and help you determine your readiness or if you need more preparation and/or experience. Successful completion of the practice exam does not guarantee you will pass the certification exam as the actual exam is longer and covers a wider range of topics. We highly recommend you should take Google Professional Cloud Data Engineer Guarantee Part because it includes real questions and highlighted answers are collected in our exam. It will help you pass the exam in an easier way.
For PDF Version: https://gcp-examquestions.com/gcp-pro-data-engineer-practice/
Part 1: gcp-pro-data-engineer-practice-exam-part-1
Part 2: gcp-pro-data-engineer-practice-exam-part-2
Part 3: gcp-pro-data-engineer-practice-exam-part-3
0 of 40 questions completed
You have already completed the quiz before. Hence you can not start it again.
Quiz is loading...
You must sign in or sign up to start the quiz.
You have to finish following quiz, to start this quiz:
0 of 40 questions answered correctly
Time has elapsed
You have reached 0 of 0 points, (0)
You are building storage for files for a data pipeline on Google Cloud. You want to support JSON files. The schema of these files will occasionally change. Your analyst teams will use running aggregate ANSI SQL queries on this data. What should you do?CorrectIncorrect
Answers: B is correct because of the requirement to support occasionally (schema) changing JSON files and aggregate ANSI SQL queries: you need to use BigQuery, and it is quickest to use ‘Automatically detect’ for schema changes.
You use a Hadoop cluster both for serving analytics and for processing and transforming data. The data is currently stored on HDFS in Parquet format. The data processing jobs run for 6 hours each night. Analytics users can access the system 24 hours a day. Phase 1 is to quickly migrate the entire Hadoop environment without a major re-architecture. Phase 2 will include migrating to BigQuery for analytics and to Cloud Dataflow for data processing. You want to make the future migration to BigQuery and Cloud Dataflow easier by following Google-recommended practices and managed services. What should you do?CorrectIncorrect
Answers: D Is correct because it leverages a managed service (Cloud Dataproc), the data is stored on GCS in Parquet format which can easily be loaded into BigQuery in the future and the Cloud Dataproc clusters are job specific.
You are building a new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?CorrectIncorrect
You are designing a streaming pipeline for ingesting player interaction data for a mobile game. You want the pipeline to handle out-of-order data delayed up to 15 minutes on a per-player basis and exponential growth in global users. What should you do?CorrectIncorrect
Answers: A is correct because the question requires delay be handled on a per-player basis and session windowing will do that. PubSub handles the need to scale exponentially with traffic coming from around the globe.
Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?CorrectIncorrect
Answers: C is correct because this is the only situation that would cause successful import.
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?CorrectIncorrect
Answers: D is correct because it uses managed services, and also allows for the data to persist on GCS beyond the life of the cluster.
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?CorrectIncorrect
Answers: B is correct because regional storage is cheaper than BigQuery storage.
You have 250,000 devices which produce a JSON device status event every 10 seconds. You want to capture this event data for outlier time series analysis. What should you do?CorrectIncorrect
Answers: C is correct because the data type, volume, and query pattern best fits BigTable capabilities and also Google best practices as linked below.
You are selecting a messaging service for log messages that must include final result message ordering as part of building a data pipeline on Google Cloud. You want to stream input for 5 days and be able to query the current status. You will be storing the data in a searchable repository. How should you set up the input messages?CorrectIncorrect
Answers: A is correct because of recommended Google practices; see the links below.
You want to publish system metrics to Google Cloud from a large number of on-prem hypervisors and VMs for analysis and creation of dashboards. You have an existing custom monitoring agent deployed to all the hypervisors and your on-prem metrics system is unable to handle the load. You want to design a system that can collect and store metrics at scale. You don’t want to manage your own time series database. Metrics from all agents should be written to the same table but agents must not have permission to modify or read data written by other agents.What should you do?CorrectIncorrect
Answers: A Is correct because Bigtable can store and analyze time series data, and the solution is using managed services which is what the requirements are calling for.
You are designing storage for CSV files and using an I/O-intensive custom Apache Spark transform as part of deploying a data pipeline on Google Cloud. You intend to use ANSI SQL to run queries for your analysts. How should you transform the input data?CorrectIncorrect
Answers: B is correct because of the requirement to use custom Spark transforms; use Cloud Dataproc. ANSI SQL queries require the use of BigQuery.
You are designing a relational data repository on Google Cloud to grow as needed. The data will be transactionally consistent and added from any location in the world. You want to monitor and adjust node count for input traffic, which can spike unpredictably. What should you do?CorrectIncorrect
Answers: B is correct because of the requirement to globally scalable transactions—use Cloud Spanner. CPU utilization is the recommended metric for scaling, per Google best practices, linked below.
You have a Spark application that writes data to Cloud Storage in Parquet format. You scheduled the application to run daily using DataProcSparkOperator and Apache Airflow DAG by Cloud Composer. You want to add tasks to the DAG to make the data available to BigQuery users. You want to maximize query speed and configure partitioning and clustering on the table. What should you do?CorrectIncorrect
Answers: C is correct because it loads the data and sets partitioning and clustering.
You have a website that tracks page visits for each user and then creates a Cloud Pub/Sub message with the session ID and URL of the page. You want to create a Cloud Dataflow pipeline that sums the total number of pages visited by each user and writes the result to BigQuery. User sessions timeout after 30 minutes. Which type of Cloud Dataflow window should you choose?CorrectIncorrect
Answers: C is correct because it continues to sum user page visits during their browsing session and completes at the same time as the session timeout.
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules: a). No interaction by the user on the site for 1 hour b). Has added more than $30 worth of products to the basket c). Has not completed a transaction. You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?CorrectIncorrect
Answers: C is correct because it will send a message per user after that user is inactive for 60 minutes.
You need to stream time-series data in Avro format, and then write this to both BigQuery and Cloud Bigtable simultaneously using Cloud Dataflow. You want to achieve minimal end-to-end latency. Your business requirements state this needs to be completed as quickly as possible. What should you do?CorrectIncorrect
Answers: C is correct because this is the right set of transformations that accepts and writes to the required data stores.
Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?CorrectIncorrect
Answers: A is correct because Google recommends using Google Cloud Storage instead of HDFS as it is much more cost effective especially when jobs aren’t running.
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you do?CorrectIncorrect
Answers: C is correct because Cloud Spanner scales horizontally, and you can create secondary indexes for the range queries that are required.
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?CorrectIncorrect
Answers: D is correct because it will allow for retrieval of data based on both sensor id and timestamp but without causing hotspotting.
Answers: D is correct because it will allow for retrieval of data based on both sensor id and timestamp but without causing hotspotting.
You are developing an application on Google Cloud that will automatically generate subject labels for users’ blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?CorrectIncorrect
Answers: A is correct because it provides a managed service and a fully trained model, and the user is pulling the entities, which is the right label.
Your company is using WILDCARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the error shown below. Which table name will make the SQL statement work correctly?
# Syntax error: Expected end of statement but got “-” at [4:11]
age != 99
AND _TABLE_SUFFIX = ‘1929’
Answers: D is correct because it follows the correct wildcard syntax of enclosing the table name in backticks and including the * wildcard character.
You are working on an ML-based application that will transcribe conversations between manufacturing workers. These conversations are in English and between 30-40 sec long. Conversation recordings come from old enterprise radio sets that have a low sampling rate of 8000 Hz, but you have a large dataset of these recorded conversations with their transcriptions. You want to follow Google-recommended practices. How should you proceed with building your application?CorrectIncorrect
Answers: A is correct because synchronous mode is recommended for short audio files.
You are developing an application on Google Cloud that will label famous landmarks in users’ photos. You are under competitive pressure to develop a predictive model quickly. You need to keep service costs low. What should you do?CorrectIncorrect
Answers: B is correct because of the requirement to quickly develop a model that generates landmark labels from photos. This is supported in Cloud Vision API; see the link below.
You are building a data pipeline on Google Cloud. You need to select services that will host a deep neural network machine-learning model also hosted on Google Cloud. You also need to monitor and run jobs that could occasionally fail. What should you do?CorrectIncorrect
Answers: B is correct because of the requirement to host an ML DNN and Google-recommended monitoring object (Jobs); see the links below.
You work on a regression problem in a natural language processing domain, and you have 100M labeled examples in your dataset. You have randomly shuffled your data and split your dataset into training and test samples (in a 90/10 ratio). After you have trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?CorrectIncorrect
Answers: D is correct since increasing model complexity generally helps when you have an underfitting problem.
You are using Cloud Pub/Sub to stream inventory updates from many point-of-sale (POS) terminals into BigQuery. Each update event has the following information: product identifier “prodSku”, change increment “quantityDelta”, POS identification “termId”, and “messageId” which is created for each push attempt from the terminal. During a network outage, you discovered that duplicated messages were sent, causing the inventory system to over-count the changes. You determine that the terminal application has design problems and may send the same event more than once during push retries. You want to ensure that the inventory update is accurate. What should you do?CorrectIncorrect
Answers: D is correct because the client application must include a unique identifier to disambiguate possible duplicates due to push retries.
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database table must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?CorrectIncorrect
Answers: C is correct because this option provides the least amount of inconvenience over using pre-specified date ranges or one table per clinic while also increasing performance due to avoiding self-joins.
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have the freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?CorrectIncorrect
Answers: A is correct because this is the best way to get granular access to data showing which users are accessing which data.
You created a job which runs daily to import highly sensitive data from an on-premises location to Cloud Storage. You also set up a streaming data insert into Cloud Storage via a Kafka node that is running on a Compute Engine instance. You need to encrypt the data at rest and supply your own encryption key. Your key should not be stored in the Google Cloud. What should you do?CorrectIncorrect
Answers: D is correct because the scenario requires you to use your own key and also to not store your key on Compute Engine, and also this is a
Google recommended practice.
You are working on a project with two compliance requirements. The first requirement states that your developers should be able to see the Google Cloud Platform billing charges for only their own projects. The second requirement states that your finance team members can set budgets and view the current charges for all projects in the organization. The finance team should not be able to view the project contents. You want to set permissions. What should you do?CorrectIncorrect
Answers: B is correct because it uses the principle of least privilege for IAM roles; use the Billing Administrator IAM role for that job function.
Suppose you have a table that includes a nested column called “city” inside a column called “person”, but when you try to submit the following query in BigQuery, it gives you an error.
SELECT person FROM `project1.example.table1` WHERE city = “London”
How would you the error?CorrectIncorrect
What are two of the benefits of using denormalized data structures in BigQuery?CorrectIncorrect
Which of these statements about exporting data from BigQuery is false?CorrectIncorrect
What are all of the BigQuery operations that Google charges for?CorrectIncorrect
Which of the following is not possible using primitive roles?CorrectIncorrect
Which of these statements about BigQuery caching is true?CorrectIncorrect
Which of these sources can you not load data into BigQuery from?CorrectIncorrect
Which of the following statements about Legacy SQL and Standard SQL is not true?CorrectIncorrect
How would you query specific partitions in a BigQuery table?CorrectIncorrect
Which SQL keyword can be used to reduce the number of columns processed by BigQuery?CorrectIncorrect