Description

This fast paced two day course focuses on data analytics through the use of the Python language, the Spark platform for highly scalable operations and Aws Glue for comprehensive data access. Extensive hands-on exercises are provided to ensure that students come away with the practical experience required to perform successfully.

Objectives

Understand the Spark platform and its architecture
Use the Spark Shell to create and run Spark applications
Work with Spark RDDs and Spark SQL DataFrames
Use AWS Glue to crawl, classify, and transform data
Build scalable and reliable data pipelines using Spark and AWS Glue

Audience

Developers, Software Engineers, Data Scientists, and IT Architects.

Upcoming Classes

Virtual Classroom Live
April 14, 2025

$1,600.00

2 days 10 AM ET - 6 PM ET

Virtual Classroom Live
May 26, 2025

$1,600.00

2 days 10 AM ET - 6 PM ET

Virtual Classroom Live
July 07, 2025

$1,600.00

2 days 10 AM ET - 6 PM ET

Private Training Available

No date scheduled, don’t see a date that works for you or looking for a private training event, please call 651-905-3729 or submit a request for further information here.

Course Overview

Chapter 1. Introduction to Apache Spark

What is Apache Spark
The Spark Platform
Spark vs Hadoop's MapReduce (MR)
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Spark Application Architecture
The Driver Process
The Executor and Worker Processes
Spark Shell
Jupyter Notebook Shell Environment
Spark Applications
The spark-submit Tool
The spark-submit Tool Configuration
Interfaces with Data Storage Systems
Project Tungsten
The Resilient Distributed Dataset (RDD)
Datasets and DataFrames
Spark SQL, DataFrames, and Catalyst Optimizer
Spark Machine Learning Library
GraphX
Extending Spark Environment with Custom Modules and Files
Summary

Chapter 2. The Spark Shell

The Spark Shell
The Spark v.2 + Command-Line Shells
The Spark Shell UI
Spark Shell Options
Getting Help
Jupyter Notebook Shell Environment
Example of a Jupyter Notebook Web UI (Databricks Cloud)
The Spark Context (sc) and Spark Session (spark)
Creating a Spark Session Object in Spark Applications
The Shell Spark Context Object (sc)
The Shell Spark Session Object (spark)
Loading Files
Saving Files
Summary

Chapter 3. Spark RDDs

The Resilient Distributed Dataset (RDD)
Ways to Create an RDD
Supported Data Types
RDD Operations
RDDs are Immutable
Spark Actions
RDD Transformations
Other RDD Operations
Chaining RDD Operations
RDD Lineage
The Big Picture
What May Go Wrong
Miscellaneous Pair RDD Operations
RDD Caching
Summary

Chapter 4. Introduction to Spark SQL

What is Spark SQL?
Uniform Data Access with Spark SQL
Hive Integration
Hive Interface
Integration with BI Tools
What is a DataFrame?
Creating a DataFrame in PySpark
Commonly Used DataFrame Methods and Properties in PySpark
Grouping and Aggregation in PySpark
The "DataFrame to RDD" Bridge in PySpark
The SQLContext Object
Examples of Spark SQL / DataFrame (PySpark Example)
Converting an RDD to a DataFrame Example
Example of Reading / Writing a JSON File
Using JDBC Sources
JDBC Connection Example
Performance, Scalability, and Fault-tolerance of Spark SQL
Summary

Chapter 5. Overview of the Amazon Web Services (AWS)

Amazon Web Services
The History of AWS
The Initial Iteration of Moving amazon.com to AWS
The AWS (Simplified) Service Stack
Accessing AWS
Direct Connect
Shared Responsibility Model
Trusted Advisor
The AWS Distributed Architecture
AWS Services
Managed vs Unmanaged Amazon Services
Amazon Resource Name (ARN)
Compute and Networking Services
Elastic Compute Cloud (EC2)
AWS Lambda
Auto Scaling
Elastic Load Balancing (ELB)
Virtual Private Cloud (VPC)
Route53 Domain Name System
Elastic Beanstalk
Security and Identity Services
Identity and Access Management (IAM)
AWS Directory Service
AWS Certificate Manager
AWS Key Management Service (KMS)
Storage and Content Delivery
Elastic Block Storage (EBS)
Simple Storage Service (S3)
Glacier
CloudFront Content Delivery Service
Database Services
Relational Database Service (RDS)
DynamoDB
Amazon ElastiCache
Redshift
Messaging Services
Simple Queue Service (SQS)
Simple Notifications Service (SNS)
Simple Email Service (SES)
AWS Monitoring with CloudWatch
Other Services Example
Summary

Chapter 6. Introduction to AWS Glue

What is AWS Glue?
AWS Glue Components
AWS Glue Components (Cont'd)
Managing Notebooks
AWS Glue Components (Cont'd)
Putting it Together: The AWS Glue Environment Architecture
AWS Glue Main Activities
Additional Glue Services
When To Use AWS Glue?
Integration with other AWS Services
Summary

Chapter 7. Introduction to Apache Spark

What is Apache Spark
The Spark Platform
Uniform Data Access with Spark SQL
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Spark Application Architecture
The Driver Process
The Executor and Worker Processes
Spark Shell
Jupyter Notebook Shell Environment
Interfaces with Data Storage Systems
The Resilient Distributed Dataset (RDD)
Datasets and DataFrames
Data Partitioning
Data Partitioning Diagram
Summary

Chapter 8. AWS Glue PySpark Extensions

AWS Glue and Spark
The DynamicFrame Object
The DynamicFrame API
The GlueContext Object
Glue Transforms
A Sample Glue PySpark Script
Using PySpark
AWS Glue PySpark SDK
Summary

Lab Exercises

Lab 1. Learning the Databricks Community Cloud Lab Environment
Lab 2. Data Visualization and EDA with pandas and seaborn
Lab 3. Correlating Cause and Effect
Lab 4. Learning PySpark Shell Environment
Lab 5. Understanding Spark DataFrames
Lab 6. Learning the PySpark DataFrame API
Lab 7. Data Repair and Normalization in PySpark
Lab 8. Working with Parquet File Format in PySpark and pandas
Lab 9. AWS Glue Overview
Lab 10. AWS Glue Crawlers and Classifiers
Lab 11. Creating an S3 Bucket for AWS Glue ETL Script Output
Lab 12. Creating and Working with Glue Scripts Using Dev Endpoints
Lab 13. Using PySpark API Directly
Lab 14. Understanding AWS Glue ETL Jobs

Upcoming Classes

Virtual Classroom Live
April 14, 2025

$1,600.00

2 days 10 AM ET - 6 PM ET

Virtual Classroom Live
May 26, 2025

$1,600.00

2 days 10 AM ET - 6 PM ET

Virtual Classroom Live
July 07, 2025

$1,600.00

2 days 10 AM ET - 6 PM ET

Private Training Available

No date scheduled, don’t see a date that works for you or looking for a private training event, please call 651-905-3729 or submit a request for further information here.

Intermediate Data Engineering with Python

2 days

Description

Upcoming Classes

Course Overview

Upcoming Classes

Prerequisites

Upcoming Classes

Find a Course

Corporate Services

Contact us