651-905-3729 Microsoft Silver Learning Partner EC Counsel Reseller compTIA Authorized Partner

Data Engineering Bootcamp Training using Python and PySpark Virtual Classroom Live August 19, 2024

Price: $3,100

This course runs for a duration of 5 days.

The class will run daily from 10 AM ET to 6 PM ET.

Class Location: Virtual LIVE Instructor Led - Virtual Live Classroom.

Enroll today to reserve your spot!

Space is limited. Enroll today.

Enroll Now

Description

This hands-on Data Engineering Bootcamp teaches attendees the foundations of data engineering using Python and Spark SQL. Students learn how to build production-ready data-driven solutions and gain a comprehensive understanding of data engineering.

Target Audience

  • Data Engineers

Skills Gained

  • Data Availability and Consistency
  • A/B Testing Data Engineering Tasks Project
  • Learning the Databricks Community Cloud Lab Environment
  • Python Variables
  • Dates and Times
  • The if, for, and try Constructs
  • Dictionaries
  • Sets, Tuples
  • Functions, Functional Programming
  • Understanding NumPy and pandas
  • PySpark

For more Python training you may be interested in, click here

Course Overview

Big Data Concepts and Systems Overview for Data Engineers

  • Gartner's Definition of Big Data
  • The Big Data Confluence Diagram
  • A Practical Definition of Big Data
  • Challenges Posed by Big Data
  • The Traditional Client - Server Processing Pattern
  • Enter Distributed Computing
  • Data Physics
  • Data Locality (Distributed Computing Economics)
  • The CAP Theorem
  • Mechanisms to Guarantee a Single CAP Property
  • Eventual Consistency
  • NoSQL Systems CAP Triangle
  • Big Data Sharding
  • Sharding Example
  • Apache Hadoop
  • Hadoop Ecosystem Projects
  • Other Hadoop Ecosystem Projects
  • Hadoop Design Principles
  • Hadoop's Main Components
  • Hadoop Simple Definition
  • Hadoop Component Diagram
  • HDFS
  • Storing Raw Data in HDFS and Schema-on-Demand
  • MapReduce Defined
  • MapReduce Shared-Nothing Architecture
  • MapReduce Phases
  • The Map Phase
  • The Reduce Phase
  • Similarity with SQL Aggregation Operations

Defining Data Engineering

  • Data is King
  • Translating Data into Operational and Business Insights
  • What is Data Engineering
  • The Data-Related Roles
  • The Data Science Skill Sets
  • The Data Engineer Role
  • Core Skills and Competencies
  • An Example of a Data Product
  • What is Data Wrangling (Munging)?
  • The Data Exchange Interoperability Options

Data Processing Phases

  • Typical Data Processing Pipeline
  • Data Discovery Phase
  • Data Harvesting Phase
  • Data Priming Phase
  • Exploratory Data Analysis
  • Model Planning Phase
  • Model Building Phase
  • Communicating the Results
  • Production Roll-out
  • Data Logistics and Data Governance
  • Data Processing Workflow Engines
  • Apache Airflow
  • Data Lineage and Provenance
  • Apache NiFi

Python 3 Introduction

  • What is Python?
  • Python Documentation
  • Where Can I Use Python?
  • Which version of Python am I running?
  • Running Python Programs
  • Python Shell
  • Dev Tools and REPLs
  • IPython
  • Jupyter
  • The Anaconda Python Distribution

Python Variables and Types

  • Variables and Types
  • More on Variables
  • Assigning Multiple Values to Multiple Variables
  • More on Types
  • Variable Scopes
  • The Layout of Python Programs
  • Comments and Triple-Delimited String Literals
  • Sample Python Code
  • PEP8
  • Getting Help on Python Objects
  • Null (None)
  • Strings
  • Finding Index of a Substring
  • String Splitting
  • Raw String Literals
  • String Formatting and Interpolation
  • String Public Method Names
  • The Boolean Type
  • Boolean Operators
  • Relational Operators
  • Numbers
  • \"Easy Numbers\"
  • Looking Up the Runtime Type of a Variable
  • Divisions
  • Assignment-with-Operation
  • Dates and Times

Control Statements and Data Collections

  • Control Flow with The if-elif-else Triad
  • An if-elif-else Example
  • Conditional Expressions (a.k.a. Ternary Operator)
  • The While-Break-Continue Triad
  • The for Loop
  • The range() Function
  • Examples of Using range()
  • The try-except-finally Construct
  • The assert Expression
  • Lists
  • Main List Methods
  • List Comprehension
  • Zipping Lists
  • Enumerate
  • Dictionaries
  • Working with Dictionaries
  • Other Dictionary Methods
  • Sets
  • Set Methods
  • Set Operations
  • Set Operations Examples
  • Finding Unique Elements in a List
  • Common Collection Functions and Operators
  • Tuples
  • Unpacking Tuples

Functions and Modules

  • Built-in Functions
  • Functions
  • The \"Call by Sharing\" Parameter Passing
  • Global and Local Variable Scopes
  • Default Parameters
  • Named Parameters
  • Dealing with Arbitrary Number of Parameters
  • Keyword Function Parameters
  • What is Functional Programming (FP)?
  • Concept: Pure Functions
  • Concept: Recursion
  • Concept: Higher-Order Functions
  • Lambda Functions in Python
  • Examples of Using Lambdas
  • Lambdas in the Sorted Function
  • Python Modules
  • Importing Modules
  • Installing Modules
  • Listing Methods in a Module
  • Creating Your Own Modules
  • Creating a Module's Entry Point

File I/O and Useful Modules

  • Reading Command-Line Parameters
  • Hands-On Exercise (N/A in DCC)
  • Working with Files
  • Reading and Writing Files
  • Random Numbers
  • Regular Expressions
  • The re Object Methods
  • Using Regular Expressions Examples

Practical Introduction to NumPy

  • NumPy
  • The First Take on NumPy Arrays
  • The ndarray Data Structure
  • Getting Help
  • Understanding Axes
  • Indexing Elements in a NumPy Array
  • Understanding Types
  • Re-Shaping
  • Commonly Used Array Metrics
  • Commonly Used Aggregate Functions
  • Sorting Arrays
  • Vectorization
  • Vectorization Visually
  • Broadcasting
  • Broadcasting Visually
  • Filtering
  • Array Arithmetic Operations
  • Reductions: Finding the Sum of Elements by Axis
  • Array Slicing
  • 2-D Array Slicing
  • Slicing and Stepping Through
  • The Linear Algebra Functions

Practical Introduction to pandas

  • What is pandas?
  • The Series Object
  • Accessing Values and Indexes in Series
  • Setting Up Your Own Index
  • Using the Series Index as a Lookup Key
  • Can I Pack a Python Dictionary into a Series?
  • The DataFrame Object
  • The DataFrame's Value Proposition
  • Creating a pandas DataFrame
  • Getting DataFrame Metrics
  • Accessing DataFrame Columns
  • Accessing DataFrame Rows
  • Accessing DataFrame Cells
  • Using iloc
  • Using loc
  • Examples of Using loc
  • DataFrames are Mutable via Object Reference!
  • The Axes
  • Deleting Rows and Columns
  • Adding a New Column to a DataFrame
  • Appending / Concatenating DataFrame and Series Objects
  • Example of Appending / Concatenating DataFrames
  • Re-indexing Series and DataFrames
  • Getting Descriptive Statistics of DataFrame Columns
  • Navigating Rows and Columns For Data Reduction
  • Getting Descriptive Statistics of DataFrames
  • Applying a Function
  • Sorting DataFrames
  • Reading From CSV Files
  • Writing to the System Clipboard
  • Writing to a CSV File
  • Fine-Tuning the Column Data Types
  • Changing the Type of a Column
  • What May Go Wrong with Type Conversion

Data Grouping and Aggregation with pandas

  • Data Aggregation and Grouping
  • Sample Data Set
  • The pandas.core.groupby.SeriesGroupBy Object
  • Grouping by Two or More Columns
  • Emulating SQL's WHERE Clause
  • The Pivot Tables
  • Cross-Tabulation

Repairing and Normalizing Data

  • Repairing and Normalizing Data
  • Dealing with the Missing Data
  • Sample Data Set
  • Getting Info on Null Data
  • Dropping a Column
  • Interpolating Missing Data in pandas
  • Replacing the Missing Values with the Mean Value
  • Scaling (Normalizing) the Data
  • Data Preprocessing with scikit-learn
  • Scaling with the scale() Function
  • The MinMaxScaler Object

Data Visualization in Python

  • Data Visualization
  • Data Visualization in Python
  • Matplotlib
  • Getting Started with matplotlib
  • The matplotlib.pyplot.plot() Function
  • The matplotlib.pyplot.bar() Function
  • The matplotlib.pyplot.pie () Function
  • The matplotlib.pyplot.subplot() Function
  • A Subplot Example
  • Figures
  • Saving Figures to a File
  • Seaborn
  • Getting Started with seaborn
  • Histograms and KDE
  • Plotting Bivariate Distributions
  • Scatter plots in seaborn
  • Pair plots in seaborn
  • Heatmaps
  • A Seaborn Scatterplot with Varying Point Sizes and Hues
  • ggplot

Python as a Cloud Scripting Language

  • Python's Value
  • Python on AWS
  • AWS SDK For Python (boto3)
  • What is Serverless Computing?
  • How Functions Work
  • The AWS Lambda Event Handler
  • What is AWS Glue?
  • PySpark on Glue - Sample Script

Introduction to Apache Spark

  • What is Apache Spark
  • The Spark Platform
  • Spark vs Hadoop's MapReduce (MR)
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Spark Application Architecture
  • The Driver Process
  • The Executor and Worker Processes
  • Spark Shell
  • Jupyter Notebook Shell Environment
  • Spark Applications
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • Interfaces with Data Storage Systems
  • The Resilient Distributed Dataset (RDD)
  • Datasets and DataFrames
  • Spark SQL, DataFrames, and Catalyst Optimizer
  • Project Tungsten
  • Spark Machine Learning Library
  • Spark (Structured) Streaming
  • GraphX
  • Extending Spark Environment with Custom Modules and Files
  • Spark 3
  • Spark 3 Updates at a Glance

The Spark Shell

  • The Spark Shell
  • The Spark v.2 + Command-Line Shells
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • Jupyter Notebook Shell Environment
  • Example of a Jupyter Notebook Web UI (Databricks Cloud)
  • The Spark Context (sc) and Spark Session (spark)
  • Creating a Spark Session Object in Spark Applications
  • The Shell Spark Context Object (sc)
  • The Shell Spark Session Object (spark)
  • Loading Files
  • Saving Files

Spark RDDs

  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark Actions
  • RDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD Lineage
  • The Big Picture
  • What May Go Wrong
  • Miscellaneous Pair RDD Operations
  • RDD Caching

Parallel Data Processing with Spark

  • Running Spark on a Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The "Big Picture"

Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Using JDBC Sources
  • Hive Integration
  • What is a DataFrame?
  • Creating a DataFrame in PySpark
  • Commonly Used DataFrame Methods and Properties in PySpark
  • Grouping and Aggregation in PySpark
  • The "DataFrame to RDD" Bridge in PySpark
  • The SQLContext Object
  • Examples of Spark SQL / DataFrame (PySpark Example)
  • Converting an RDD to a DataFrame Example
  • Example of Reading / Writing a JSON File
  • Performance, Scalability, and Fault-tolerance of Spark SQL

Lab Exercises

  • Lab 1. Data Availability and Consistency
  • Lab 2. A/B Testing Data Engineering Tasks Project
  • Lab 3. Learning the Databricks Community Cloud Lab Environment
  • Lab 4. Python Variables
  • Lab 5. Dates and Times
  • Lab 6. The if, for, and try Constructs
  • Lab 7. Understanding Lists
  • Lab 8. Dictionaries
  • Lab 9. Sets
  • Lab 10. Tuples
  • Lab 11. Functions
  • Lab 12. Functional Programming
  • Lab 13. File I/O
  • Lab 14. Using HTTP and JSON
  • Lab 15. Random Numbers
  • Lab 16. Regular Expressions
  • Lab 17. Understanding NumPy
  • Lab 18. A NumPy Project
  • Lab 19. Understanding pandas
  • Lab 20. Data Grouping and Aggregation
  • Lab 21. Repairing and Normalizing Data
  • Lab 22. Data Visualization and EDA with pandas and seaborn
  • Lab 23. Correlating Cause and Effect
  • Lab 24. Learning PySpark Shell Environment
  • Lab 25. Understanding Spark DataFrames
  • Lab 26. Learning the PySpark DataFrame API
  • Lab 27. Data Repair and Normalization in PySpark
  • Lab 28. Working with Parquet File Format in PySpark and pandas

Prerequisites

  • Some working experience in any programming language (the students will be introduced to programming in Python).
  • Basic understanding of SQL and data processing concepts, including data grouping and aggregation.

Other Available Dates for this Course

Virtual Classroom Live
January 20, 2025

$3,100.00
5 days    10 AM ET - 6 PM ET
view class details and enroll
Virtual Classroom Live
March 03, 2025

$3,100.00
5 days    10 AM ET - 6 PM ET
view class details and enroll
Virtual Classroom Live
April 14, 2025

$3,100.00
5 days    10 AM ET - 6 PM ET
view class details and enroll