I took the Google Cloud Professional Data Engineer Exam in 2020 and I would like to share my tips and all resources preparation. In this guide, I will give you an idea of what the exam is like and I how I prepared for it (and passed!).
I spent about two months casually watching a couple of videos a day and then an intensive one month of practice papers and nailing the key points. I took the exam in early Oct 2020.
Course: Data Engineering with GCP Professional Certificate
Cost: £38/month (1-week free trial)
Coursera was the first online course I took and is taught by Google employees. They offer a combination of presentations, hands-on labs and demos. I found this course to be quite advanced for someone without any prior commercial experience. I wasn’t aware of current technologies such as those in the Hadoop ecosystem and was overwhelmed with many unfamiliar terminologies.
Course: Google Cloud Professional Data Engineer Exam Questions
Cost: $10.99 (lifetime accessible) Depend on Udemy Event.
This course include actual exam questions and very useful for practice and ensure you can pass the exam. I spend a ton of time with this course for practice exam. Highly recommend this one. The best actual exam dumps on Udemy.
Course: Google Cloud Certified Professional Data Engineer
Cost: $49/month (1 week free trial), $80/3 months (student subscription)
This course gives a high-level overview of each Google Cloud service and covers key concepts as well as Google’s best practices for using each one. The course was structured well, starting from the foundational concepts, to the different types of databases, architecting pipelines, machine learning and data visualization. I found this one very easy to follow and was kind on the novice. Matthew expertise on the Google Cloud Platform meant his explanations were very clear and concise. In his videos, he often highlights key facts and concepts that you can expect to come up in the exam.
This course is Google’s fast-paced, practical introduction to machine learning. I used this course a quick refresher since I had already covered some of the popular algorithms and concepts in a university course. However, I think it’s well structured and will give you a good foundation in machine learning.
This course is a comprehensive introduction to the world of Big Data. It covers the principles of Big Data infrastructures and its integration with cloud computing.
The most useful thing about this course for me was that it provided a high-level overview of the most popular Big Data technologies including core Hadoop, the Hadoop ecosystem (Hive, Pig, Kafka etc) and Apache Spark. For the exam, you will be expected to know what each of these does and their GCP equivalent. e.g Kafka -> PubSub, Hive -> BigQuery.
Qwiklabs is a platform that provides temporary credits to use on various cloud platforms in the format of tutorials and demos. You can use Qwiklabs to get extra hands-on experience with the Google Cloud Platform. I took the Baseline: Data, ML, AI / Data Engineering / BigQuery for Data Warehousing courses. I found the last one quite useful as it allows you to practise writing query statements in BigQuery.
Read the exam syllabus first
What the exam is testing you on is your ability to come up with a feasible technical solution to the business requirement. Although there will never be one perfect solution, a great solution should take the four points mentioned above into consideration. In the exam, you can expect to see several options that may meet the business or technical requirement but one is better than the rest. The first mistake I made was to not read the official Google exam guide properly and naively assume that the online courses would cover this. What the online courses do well is that they cover each GCP service in detail, outline its common use cases, Google’s best practices for using the service and how they fit in the overall data engineering role.
Cost vs Performance
The hardest part of the exam for me was the trade-off between cost and performance. (This is probably where your 3+ years of industry experience might come in handy!). The business people think in terms of minimal costs and the technical people think in terms of increasing performance. A common question is phrased along the lines of ‘A mid-size company wants to [do something] whilst keeping costs low’. While some situations are obvious such as using Cloud SQL vs Cloud Spanner, scenarios which involve running jobs are harder since you could use more powerful CPU’s (which costs more) to run a job, resulting in a shorter runtime (costs less).
Some other areas you should look at
Cloud Key Management & Data Encryption — Data security is a very important component of data engineering. By default, GCP encrypts all customer data at rest. You should be aware that there are other encryption methods including CMEK, CSEK and Client-Side Encryption. Google Cloud has a Cloud Key Management Service that lets you manage cryptographic keys for your cloud services.
Kafka — I read from other sources that Kafka appears a lot in the exam and it was true in my case as well. The Google Cloud equivalent is Cloud Pub/Sub. You should be aware of the differences between the two such as Pub/Sub only holds data for up to seven days whereas Kafka can store as much data as you want and can be accessed anytime.
BigQuery ML — Recently, there’s a trend to make AI accessible to everyone. BigQuery ML allows users to create and execute machine learning models in BigQuery using standard SQL queries as opposed to writing code. One of the big advantages is that the data does not leave BigQuery. Check out this video for a simple explanation.
Failover replica — As stated in syllabus 4.3 of the exam guide, it is important to think about a backup solution for your data infrastructure. Some Google Cloud services create copies of the data automatically, whereas others have to be set up manually. Think about which regions and zones you want to provision services. What happens if a data centre in a zone goes down? How do you prepare for this?
If you don’t have a lot of experience using the Google Cloud Platform, I recommend you practise using the hands-on labs and Qwiklabs. You shouldn’t memorise how to do each task, but use it as an opportunity to familiarise yourself with the service and the overall GCP environment. If you get the Linux Academy subscription, you get access to the cloud sandbox where you can receive guest user credentials to use GCP for 3 hours per session. You can still gain valuable practice without breaking your wallet by using this cloud sandbox or by creating a free GCP account ($300 free credits) and following the task instructions for the hands-on labs on Coursera or Qwiklabs. (On Qwiklabs, you are essentially paying to use the GCP environment for a fixed time)
The course instructors strongly that doing the online course only is not enough to pass the exam. Your preparation should include a variety of online resources, hands-on practice with GCP and past exam questions. There are two more things that I also included in my revision that I highly recommend.
- Go through the case studies — They used to be a part of the exam before March 2019 but are not in the current syllabus. Having said that, I found it very useful to go through these case studies. As a student without any prior data engineering experience, it was helpful to see the real-life applications and the thought process data engineers go through to come up with a technical solution to a business requirement. You can view and go through these case studies on Preparing for the Google Cloud Professional Data Engineer Exam course on Coursera. You can watch a group of experienced professionals discuss a few potential solutions to one of the case studies here.
- Read the official Google Cloud Documentation — It’s unrealistic to read the whole documentation but my recommendation is to read the documentation for the topics covered in the online videos. The points you should remember are Google’s Best Practices (for doing certain things) and the quotas and limits for the some of the services (e.g. store up to 3TB of data for Cloud SQL). The rule of thumb is that if you are likely to look it up on the job (e.g. how much does 2vCPU compute engine cost per month) you don’t need to memorise it. However, you should roughly know, the upper limit for different storage types.
During the exam
The exam consists of 50 questions and you have 2 hours to finish it. There is a bookmark feature where you can bookmark questions for later review. From what I’ve read online, most people took about 1 to 1 hour 15 minutes to complete the test. I used the full 2 hours. I would say that the exam is about 20% times harder than the practice exam. Many of the questions can inflict a lot of self-doubts such as the cost vs performance trade-off. My tip is to pace yourself and don’t waste too much time on a single question. If you don’t know the answer or are unsure about your answer, bookmark it and move on to the next. Don’t panic if you can’t answer all the questions the first time. I got stumped on the first 5 questions and then bookmarked more than half of the questions for later review.
Well done on making it this far! I hope this guide has been helpful and a hopefully a confidence booster for those taking the exam soon. It is definitely possible to pass this exam without commercial experience and prepare for it from a theoretical standpoint. I’ve added a few links below that helped me in my preparation for the exam. These individuals also share their experience and provide some great advice and areas to look out for.