CCA175 (2021)— Hadoop and Spark Developer: How to study?

Manas Yadav
3 min readJun 20, 2021

I attempted Cloudera CCA175 after preparing for the same for almost a month or so. I would attempt to split the details into multiple blogs, in here I plan to list down all resources needed so that time is spent on learning rather than researching for the certificate.

1) Check how and what the exam is all about. Ensure that the exam is a part of your goals and putting in the efforts and money is worth the cost.

2) If the above is checked then go to the below page and have a look at the topics

Skip the old questions and discussions on CCA175 on forums prior to 2020. CCA175 is all about Spark and Spark only (no hive, no sqoop, would be a surprise if you get a question related to “jdbc connection” or RDDs).

3) Register for a Udemy course if you get it at an affordable price. I am listing the ones that I used -

4) Get familiar with the Spark documentation (do not print or export as pdf, go through the online site). No need to learn by heart (if you can do it all the better). At the very least you must have enough navigation skills on the Spark documentation to find any topic that you need to solve the problem. One glance at the problem and you must be able to identify the keyword to use.

5) Go through the multiple blogs (listing down a few I could find) -

Once you have read the blogs start collecting practical questions related to Spark (like how to load file into hdfs and then save the same as parquet). Connect with me on linkedln if you need help with the questions.

6) Once you have say a set of 3–4 odd questionnaires, here is what I did for practice -
a) Try solving the questions using any hack available (like searching on google, discussing with a friend etc). Usually 3 methods exists to solve a problem in Spark — SQL, Dataframes and RDD. I practiced with all these three but there is no need to learn all three formats. But practicing multiple ways did help me during the exam — one of my solutions did not work with data frames and I had to solve the same using Spark SQL
b) Second run — Same set of questions as above (plus one to completely new questionnaire ), I attempted solving the same using Spark documentation only (ensure that you use the same link as provided on Cloudera page). Repeat the exams until you are comfortable completing the question using the spark documentation only.
c) Third run — Same questions I practiced without referencing the Spark documentation. Created a new questionnaires (different set of question) and attempted the same just with the help of Spark documentation. At this point of time I was able to complete the test (8–10 questions) within 60 mins

7) Where to practice??
>> Deploy spark on your local windows machine —
>>Try a free account on cloud (AWS/GCP) for practice —
>> CDP VM machine from Cloudera

8) What to study -
Spark documentation
Go through any of the Udemy courses

9) Where to focus -
>> Writing and reading multiple file formats — Parquet, CSV, Avro
>> Compression — Snappy, Gzip
>> Save as a table
>> Joins, Subqueries
>> Aggregation and Analytical functions
>> Repartition & Coalesce
>> Date conversion as well as String functions

Do have a look at page. Most of the tips and trick on the page are good for any Cloudera exams.