This fast paced two day course focuses on data analytics through the use of the Python language, the Spark platform for highly scalable operations and Aws Glue for comprehensive data access. Extensive hands-on exercises are provided to ensure that students come away with the practical experience required to perform successfully.
Objectives
Audience
Developers, Software Engineers, Data Scientists, and IT Architects.
Chapter 1. Introduction to Apache Spark
Chapter 2. The Spark Shell
Chapter 3. Spark RDDs
Chapter 4. Introduction to Spark SQL
Chapter 5. Overview of the Amazon Web Services (AWS)
Chapter 6. Introduction to AWS Glue
Chapter 7. Introduction to Apache Spark
Chapter 8. AWS Glue PySpark Extensions
Lab Exercises
Lab 1. Learning the Databricks Community Cloud Lab Environment
Lab 2. Data Visualization and EDA with pandas and seaborn
Lab 3. Correlating Cause and Effect
Lab 4. Learning PySpark Shell Environment
Lab 5. Understanding Spark DataFrames
Lab 6. Learning the PySpark DataFrame API
Lab 7. Data Repair and Normalization in PySpark
Lab 8. Working with Parquet File Format in PySpark and pandas
Lab 9. AWS Glue Overview
Lab 10. AWS Glue Crawlers and Classifiers
Lab 11. Creating an S3 Bucket for AWS Glue ETL Script Output
Lab 12. Creating and Working with Glue Scripts Using Dev Endpoints
Lab 13. Using PySpark API Directly
Lab 14. Understanding AWS Glue ETL Jobs
Participants must have practical experience coding in Python or another modern programming language. Knowledge of AWS Management Console is desirable but not necessary. The students are expected to be able to quickly learn the new material and reinforce the knowledge of a learned topic by doing programming exercises (labs).