
[May-2024] Latest Google Professional-Data-Engineer exam dumps and online Test Engine
Google Professional-Data-Engineer: Selling Google Cloud Certified Products and Solutions
The Google Professional-Data-Engineer exam is intended for data engineers, data analysts, and other professionals who work with large data sets and need to design and implement scalable, reliable, and efficient data processing systems. It is also suitable for IT professionals who are responsible for managing data pipelines and ensuring the security, privacy, and compliance of data on Google Cloud Platform.
NEW QUESTION # 18
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
- A. Cloud Composer
- B. Cloud Dataprep
- C. Cloud Dataflow
- D. Cloud Dataproc
Answer: A
Explanation:
Cloud Composer uses airflow which is open source and can help to orchestrate jobs.
NEW QUESTION # 19
If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use?
- A. Classifier
- B. Unsupervised learning
- C. Clustering estimator
- D. Regressor
Answer: D
Explanation:
Explanation
Regression is the supervised learning task for modeling and predicting continuous, numeric variables.
Examples include predicting real-estate prices, stock price movements, or student test scores.
Classification is the supervised learning task for modeling and predicting categorical variables. Examples include predicting employee churn, email spam, financial fraud, or student letter grades.
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset. Examples include customer segmentation, grouping similar items in e-commerce, and social network analysis.
Reference: https://elitedatascience.com/machine-learning-algorithms
NEW QUESTION # 20
By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?
- A. Windows at every 1 minute
- B. Windows at every 100 MB of data
- C. Single, Global Window
- D. Windows at every 10 minutes
Answer: C
Explanation:
Explanation
Dataflow's default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections Reference: https://cloud.google.com/dataflow/model/pcollection
NEW QUESTION # 21
When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.
- A. node
- B. type
- C. zone
- D. label
Answer: C
Explanation:
At a minimum, you must specify four values when creating a new cluster with the projects.regions.clusters.create operation:
The project in which the cluster will be created
The region to use
The name of the cluster
The zone in which the cluster will be created
You can specify many more details beyond these minimum requirements. For example, you can also specify the number of workers, whether preemptible compute should be used, and the network settings.
Reference:
https://cloud.google.com/dataproc/docs/tutorials/python-library-example#create_a_new_cloud_dataproc_cluste
NEW QUESTION # 22
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time.
This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
- A. Tier older data onto Cloud Storage files, and leverage extended tables.
- B. Implement clustering in BigQuery on the package-tracking ID column.
- C. Implement clustering in BigQuery on the ingest date column.
- D. Re-create the table using data partitioning on the package delivery date.
Answer: C
NEW QUESTION # 23
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally. You also want to optimize data for range queries on nonkey columns. What should you do?
- A. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
- B. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
- C. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
- D. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
Answer: A
NEW QUESTION # 24
Case Study: 1 - Flowlogistic
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market.
Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
Databases
8 physical servers in 2 clusters
SQL Server - user data, inventory, static data
3 physical servers
Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
Application servers - customer front end, middleware for order/customs 60 virtual machines across 20 physical servers Tomcat - Java services Nginx - static content Batch servers Storage appliances iSCSI for virtual machine (VM) hosts Fibre Channel storage area network (FC SAN) ?SQL server storage Network-attached storage (NAS) image storage, logs, backups Apache Hadoop /Spark servers Core Data Lake Data analysis workloads
20 miscellaneous servers
Jenkins, monitoring, bastion hosts,
Business Requirements
Build a reliable and reproducible environment with scaled panty of production. Aggregate data in a centralized Data Lake for analysis Use historical data to perform predictive analytics on future shipments Accurately track every shipment worldwide using proprietary technology Improve business agility and speed of innovation through rapid provisioning of new resources Analyze and optimize architecture for performance in the cloud Migrate fully to the cloud if all other requirements are met Technical Requirements Handle both streaming and batch data Migrate existing Hadoop workloads Ensure architecture is scalable and elastic to meet the changing demands of the company.
Use managed services whenever possible
Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability.
Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they've purchased a visualization tool to simplify the creation of BigQuery reports. However, they've been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?
- A. Export the data into a Google Sheet for virtualization.
- B. Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.
- C. Create a view on the table to present to the virtualization tool.
- D. Create an additional table with only the necessary columns.
Answer: C
NEW QUESTION # 25
You have a data pipeline with a Dataflow job that aggregates and writes time series metrics to Bigtable. You notice that data is slow to update in Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. What should you do?
Choose 2 answers
- A. Modify your Dataflow pipeline lo use the Flatten transform before writing to Bigtable.
- B. Modify your Dataflow pipeline to use the CoGrcupByKey transform before writing to Bigtable.
- C. Configure your Dataflow pipeline to use local execution.
- D. Increase the number of nodes in the Bigtable cluster.
- E. Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions.
Answer: D,E
Explanation:
https://cloud.google.com/bigtable/docs/performance#performance-write-throughput
https://cloud.google.com/dataflow/docs/reference/pipeline-options
NEW QUESTION # 26
Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?
- A. Store the common data in BigQuery and expose authorized views.
- B. Store the common data in BigQuery as partitioned tables.
- C. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.
- D. Store the common data encoded as Avro in Google Cloud Storage.
Answer: A
NEW QUESTION # 27
Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?
- A. Redefine the schema by evenly distributing reads and writes across the row space of the table.
- B. The performance issue should be resolved over time as the site of the BigDate cluster is increased.
- C. Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.
- D. Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.
Answer: A
Explanation:
https://cloud.google.com/bigtable/docs/performance#troubleshooting
If you find that you're reading and writing only a small number of rows, you might need to redesign your schema so that reads and writes are more evenly distributed.
NEW QUESTION # 28
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time.
This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
- A. Tier older data onto Cloud Storage files, and leverage extended tables.
- B. Implement clustering in BigQuery on the package-tracking ID column.
- C. Re-create the table using data partitioning on the package delivery date.
- D. Implement clustering in BigQuery on the ingest date column.
Answer: B
NEW QUESTION # 29
You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?
- A. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
- B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
- C. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
- D. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
Answer: D
NEW QUESTION # 30
You are developing an application on Google Cloud that will automatically generate subject labels for users' blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?
- A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
- B. Build and train a text classification model using TensorFlow. Deploy the model using Cloud Machine Learning Engine. Call the model from your application and process the results as labels.
- C. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.
- D. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
Answer: A
NEW QUESTION # 31
You need to choose a database to store time series CPU and memory usage for millions of computers. You need to store this data in one-second interval samples. Analysts will be performing real-time, ad hoc analytics against the database. You want to avoid being charged for every query executed and ensure that the schema design will allow for future growth of the dataset. Which database and data model should you choose?
- A. Create a table in BigQuery, and append the new samples for CPU and memory to the table
- B. Create a wide table in BigQuery, create a column for the sample value at each second, and update the row with the interval for each second
- C. Create a wide table in Cloud Bigtable with a row key that combines the computer identifier with the sample time at each minute, and combine the values for each second as column data.
- D. Create a narrow table in Cloud Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second
Answer: C
NEW QUESTION # 32
You are implementing a chatbot to help an online retailer streamline their customer service. The chatbot must be able to respond to both text and voice inquiries. You are looking for a low-code or no-code option, and you want to be able to easily train the chatbot to provide answers to keywords. What should you do?
- A. Use Dialogflow to implement the chatbot. defining the intents based on the most common queries collected.
- B. Use the Speech-to-Text API to build a Python application in a Compute Engine instance.
- C. Use the Speech-to-Text API to build a Python application in App Engine.
- D. Use Dialogflow for simple queries and the Speech-to-Text API for complex queries.
Answer: A
Explanation:
Dialogflow is a conversational AI platform that allows for easy implementation of chatbots without needing to code. It has built-in integration for both text and voice input via APIs like Cloud Speech-to-Text. Defining intents and entity types allows you to map common queries and keywords to responses. This would provide a low/no-code way to quickly build and iteratively improve the chatbot capabilities.
https://cloud.google.com/dialogflow/docs Dialogflow is a natural language understanding platform that makes it easy to design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Using Dialogflow, you can provide new and engaging ways for users to interact with your product. Dialogflow can analyze multiple types of input from your customers, including text or audio inputs (like from a phone or voice recording). It can also respond to your customers in a couple of ways, either through text or with synthetic speech.
NEW QUESTION # 33
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-
400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?
- A. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
- B. Increase the size of your parquet files to ensure them to be 1 GB minimum.
- C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
- D. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
Answer: C
NEW QUESTION # 34
To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?
- A. gcloud ml-engine jobs submit training
- B. gcloud ml-engine jobs submit training local
- C. You can't run a TensorFlow program on your own computer using Cloud ML Engine .
- D. gcloud ml-engine local train
Answer: D
Explanation:
gcloud ml-engine local train - run a Cloud ML Engine training job locally This command runs the specified module in an environment similar to that of a live Cloud ML Engine Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are properly interacting with the Cloud ML Engine cluster configuration.
Reference: https://cloud.google.com/sdk/gcloud/reference/ml-engine/local/train
NEW QUESTION # 35
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly.
How should you optimize the cluster for cost?
- A. Migrate the workload to Google Cloud Dataflow
- B. Use SSDs on the worker nodes so that the job can run faster
- C. Use a higher-memory node so that the job runs faster
- D. Use pre-emptible virtual machines (VMs) for the cluster
Answer: A
NEW QUESTION # 36
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?
- A. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
- B. Load the data every 30 minutes into a new partitioned table in BigQuery.
- C. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
- D. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
Answer: B
NEW QUESTION # 37
......
To become a Google Certified Professional Data Engineer, candidates need to pass the Professional-Data-Engineer exam. Professional-Data-Engineer exam consists of multiple-choice and multiple-select questions, and it is intended for professionals who have a minimum of three years of industry experience in data engineering. Professional-Data-Engineer exam duration is 2 hours, and the passing score is 70%. Professional-Data-Engineer exam fee is $200, and it can be taken at a testing center or remotely online.
New 2024 Professional-Data-Engineer Test Tutorial (Updated 333 Questions): https://pass4sure.examcost.com/Professional-Data-Engineer-practice-exam.html

