Cloudera CCA159 — CCA Data Analyst Exam
Cloudera is the pioneer in the field of data analysis, data engineering and data science and Cloudera certification go a big way in confirming your ability to increase your knowledge in the big data world.
Unlike multiple choice questions based certifications all Cloudera exams are hands-on on live environment with real time like scenario problems.
Certification page link — Cloudera Certified Associate Data Analyst
Udemy Learning Link — CCA 159 — Data Analyst using Sqoop, Hive and Impala | Udemy
Hopefully you have already decided to plunge into the world of SQL and data analysis and are interested in the exam. In this post I would talk about tips & tricks which would help you during the exam.
Before the exam -
1) Ensure if you are using more than one monitor you have kept your screen setting as “Duplicate” or only one screen is active. Exam proctor would not allow two monitors in Extended mode.
2) Use the largest monitor that you have as the font size is small. I have a 24 inch monitor and I had to struggle to view the fonts.
3) Keep practicing on Plume Editor (which would be available during the exam)
4) Remove all extra toolbars on your browser. Learn how to use the Full Screen mode in the browser
5) Practice completing 10–15 odd questions within 1 and half hour. Keep in mind that you may get a single node machine and all your operations would run on huge datasets and would take time.
During the exam -
1) Increase the font size. There would be an icon on the desktop which can be used to modify the pixels. — 1 Minute
2) Open Plume editor, create and save same number of tabs as the questions and an extra tab to capture the status of each question— 2 Minutes
3) Copy all questions to the plume editor (one question in each tab) — 2 Minutes
4) Read through all the questions — 5 Minutes
5) Open two terminals (look at extreme top of screen for the terminal icon) — One for files and another for executing Spark/hive/impala. You do have a Hue interface available
6) Mostly it is a one node machine so avoid overloading. I ran two commands in parallel and both failed.
7) While executing the SELECT clause ensure that you use the limit clause. Keep the value as 5. While verifying if you use LIMIT or head or tail use a value of 50. This is to avoid a scenario wherein you create your report or create table with the LIMIT clause included by mistake and end up with 5 rows only
8) Complete questions — 1 hour 20 mins
9) Verification — 30 minutes
>> Verify the count in the tables
>> Verify the count in the created files
>> Verify the data, column names, column position and the record separators
10) If you are one of those who do not like the unnecessary logs being printed on the beeline use the below -
set hive.server2.logging.operation.enabled=false;
11) If you need to copy and paste the column names use desc <<tablename>> or if you only want to list the column names in beeline use -
show columns in products;
This would help you quickly copy the column names and create a new table.
12) Prefer using Impala or Impala shell (instead of hive/beeline) during the test to speed up execution if possible.
13) Take care to not drop tables during the exam. Also learn how to recover a dropped table in hive (just in case you drop one).
14) Be really careful with schema name, tables name, columns name and order of columns.
15) Learn how to execute common hdfs command from beeline (or hive) command line
Topics to Focus Upon -
1) Joins
2) Window Functions
3) Partitioning
4) Views
5) Export a table data / Write to a directory / Create Report
6) How to run queries from a file
hive -f
7) Change column names
8) String Concatenation
Some technical stuff you should know -
- Dates
A) Convert string to date:
select to_date(from_unixtime(unix_timestamp(order_date),’yyyy-MM-dd hh:mm:ss’))) from orders limit 10;
B) Add days to date
select date_add(to_date(from_unixtime(unix_timestamp(order_date),’yyyy-MM-dd hh:mm:ss’)),1) from orders limit 10;
C) Add time to date
select from_unixtime(unix_timestamp(order_date) + 3600,’yyyy-MM-dd hh:mm:ss’) from orders limit 10; - split: splits a line into arrays
select split(line,’ ‘) word from files - explode: explodes an array into rows
select explode(split(line,’ ‘)) word from files - LATERAL VIEW
select r_regionkey, SUM(nations_details.n_nationkey), COUNT(nations_details.n_name) from region LATERAL VIEW explode(r_nations) exploded_table as nations_details GROUP BY r_regionkey;
SELECT tf.* FROM (SELECT 0) t LATERAL VIEW EXPLODE(array(‘A’,’B’,’C’)) tf as col; - concat
select concat(‘a’,’b’,’c’); - concat_ws
select concat_ws(‘|’,’a’,’b’,’c’); - collect_list
select deptno, collect_list(deptid), max(sal) from dept group by deptno;
+ — — — — -+ — — — — + — — — +
| deptno | _c1 | _c2 |
+ — — — — -+ — — — — + — — — +
| 1 | [1,2] | 11 |
+ — — — — -+ — — — — + — — — + - Collect_set
Same as collect_list but removes the duplicates - Generating report file in hive
INSERT OVERWRITE DIRECTORY ‘/user/<<UserName>>/data/region/’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE SELECT * FROM region LIMIT 10; - Table Stored as Avro
CREATE TABLE kst (field1 string, field2 int)
PARTITIONED BY (ds string)
STORED AS AVRO;
INSERT INTO TABLE kst PARTITION(ds=’A’) VALUES (‘B’,3);
create table kst_2 STORED AS AVRO as select * from kst where 1=2;
create table kst_3 STORED AS AVRO TBLPROPERTIES(“avro.compress”=”snappy”) as select * from kst where 1=2; - ALTER TABLE
alter table m change `_c1` cnt bigint; - For transactional tables
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
set hive.support.concurrency=true; - Recover deleted files in hadoop
hdfs dfs -ls /user/<<UserOrg>>/.Trash/Current/user/<<UserOrg>>/warehouse/test.db/dept1
Move the file to correct directory and load
Ping me Manas Yadav | LinkedIn if you need any help