ProspenAfrica | Training and Consulting Services Provider

Big Data Analytics with Spark Training


Dates: Available on Request
Locations: Johannesburg, South Africa
Platform: Available In-Class

Price: Available on request

Course Introduction

The analysis of large datasets involves using an equally large set of computers. Successfully using so many computers entails the use of distributed file systems, such as the Hadoop Distributed File System (HDFS), and parallel computational models, such as Hadoop, MapReduce, and Spark.


In this Big Data Analytics with Spark Training Course, you will learn the essential components of vast parallel computation projects and how to use Spark to minimize bottlenecks. This course will teach you how to conduct supervised and unsupervised machine learning on substantial datasets using the Machine Learning Library (MLlib) and gain hands-on experience using PySpark.


Skills Covered

This program will provide you with knowledge and expertise in:

  • Scala programming

  • Spark installation

  • Resilient Distributed Datasets (RDD)

  • SparkSQL

  • Spark Streaming

  • Spark ML Programming

  • GraphX programming

Course Objectives

Upon successfully completing this Big Data Analytics with Spark Training course, participants will be able to:

  • Obtain an overview of Big Data & Hadoop, including HDFS and YARN (Yet Another Resource Negotiator)

  • Gain comprehensive knowledge of various tools in the Spark ecosystem

  • Understand how to ingest data into HDFS using Sqoop & Flume

  • Program Spark using PySpark

  • Identify the computational trade-offs in a Spark application

  • Model data using statistical and machine learning methods

  • Use real-time data feeds through a publish-subscribe messaging system like Kafka

  • Gain exposure to various real-life industry-based projects

  • Study projects in diverse domains, such as banking, telecommunications, social media, and government

Organisational Benefits

Companies that send employees to participate in this course can benefit by:

  • Adopting technology used successfully by multiple companies in various domains globally

  • Attracting more investors, as 56% of enterprises will increase their investment in big data over the next three years (according to Forbes)

  • Providing the workforce with flexible and cost-effective professional development opportunities

  • Analysing case studies in this domain and applying successful techniques in their organization

  • Comprehending the principles and practice of Big Data Analytics and its operational context

Who should attend?

This course is suitable for:

  • Developers and Architects

  • BI / ETL / DW Professionals

  • Senior IT Professionals

  • Testing Professionals

  • Mainframe Professionals

  • Freshers

  • Big Data Enthusiasts

  • Software Architects, Engineers, and Developers

  • Data Scientists and Analytics Professionals

IT and Records Courses

Training Methodology

Our diverse instructional approaches ensure effective learning:

– Lectures & Presentations: Engage with expert-driven, stimulating content.
– Course Material: Access well-crafted supporting resources.
– Group Work: Collaborate on discussions and case studies for practical insights.
– Workshops & Role-Play: Participate in immersive, scenario-based activities.
– Practical Application: Focus on applying theoretical knowledge in real situations.
– Post-Training Support: Receive extensive support after training for skill implementation.

Training Outline

Module 1: Introduction to Big Data Hadoop and Spark

  • What is Big Data?

  • Big Data Customer Scenarios

  • Big Data and Hadoop

  • How Hadoop Solves the Big Data Problem

  • What is Hadoop?

  • Hadoop’s Key Characteristics

  • Hadoop Ecosystem and HDFS

  • Hadoop Core Components

  • Rack Awareness and Block Replication

  • YARN and its Advantage

  • Hadoop Cluster and its Architecture

  • Hadoop: Different Cluster Modes

  • Why Spark is needed?

  • What is Spark?

  • How Spark differs from other frameworks?

  • Spark at Yahoo!

Module 2: Introduction to Scala for Apache Spark

  • What is Scala?

  • Why Scala for Spark?

  • Scala in other Frameworks

  • Control Structures in Scala

  • Foreach loop, Functions, and Procedures

  • Collections in Scala: Array

  • Introduction to Scala REPL

  • Basic Scala Operations

  • Variable Types in Scala

  • ArrayBuffer, Map, Tuples, Lists, and more

  • Scala REPL Detailed Demo

Module 3: Functional Programming and OOP Concepts in Scala

  • Auxiliary Constructor and Primary Constructor

  • Singletons

  • Extending a Class

  • Overriding Methods

  • Traits as Interfaces and Layered Traits

  • OOP Concepts

  • Functional Programming

  • Higher-Order Functions

  • Anonymous Functions

  • Class in Scala

  • Getters and Setters

  • Custom Getters and Setters

  • Properties with only Getters

Module 4: Deep Dive into Apache Spark Framework

  • Submitting Spark Job

  • Spark Web UI

  • Data Ingestion using Sqoop

  • Building and Running Spark Application

  • Spark Application Web UI

  • Spark’s Place in the Hadoop Ecosystem

  • Spark Components & its Architecture

  • Spark Deployment Modes

  • Introduction to Spark Shell

  • Writing your first Spark Job Using SBT

  • Configuring Spark Properties

  • Data ingestion using Sqoop

Module 5: Playing with Spark RDDs

  • RDD Persistence

  • WordCount Program Using RDD Concepts

  • Passing Functions to Spark

  • Loading data in RDDs

  • Saving data through RDDs

  • RDD Transformations

  • Challenges in Existing Computing Methods

  • Probable Solution & How RDD Solves the Problem

  • What is RDD, Its Operations, Transformations & Actions

  • Data Loading and Saving Through RDDs

  • Key-Value Pair RDDs

  • Other Pair RDDs, Two Pair RDDs

  • RDD Lineage

  • RDD Actions and Functions

  • RDD Partitions

  • WordCount through RDDs

Module 6: DataFrames and Spark SQL

  • Need for Spark SQL

  • What is Spark SQL?

  • Spark SQL Architecture

  • Spark – Hive Integration

  • Spark SQL – Creating Data Frames

  • Loading and Transforming Data through Different Sources

  • Stock Market Analysis

  • Spark-Hive Integration

  • SQL Context in Spark SQL

  • User-Defined Functions

  • Data Frames & Datasets

  • Interoperating with RDDs

  • JSON and Parquet File Formats

  • Loading Data through Different Sources

Module 7: Machine Learning Using Spark MLlib

  • Why Machine Learning?

  • What is Machine Learning?

  • Where Machine Learning is Used?

  • Face Detection: USE CASE

  • Different Types of Machine Learning Techniques

  • Introduction to MLlib

  • Features of MLlib and MLlib Tools

  • Various ML algorithms supported by MLlib

Module 8: Deep Dive into Spark MLlib

  • K-Means Clustering

  • Linear Regression

  • Logistic Regression

  • Decision Tree

  • Random Forest

  • Machine Learning MLlib

Module 9: Understanding Apache Kafka and Apache Flume

  • What is Apache Flume?

  • Need of Apache Flume

  • Basic Flume Architecture

  • Flume Sources

  • Flume Sinks

  • Flume Channels

  • Flume Configuration

  • Need for Kafka

  • What is Kafka?

  • Core Concepts of Kafka

  • Kafka Architecture

  • Where is Kafka Used?

  • Understanding the Components of Kafka Cluster

  • Configuring Kafka Cluster

  • Kafka Producer and Consumer Java API

  • Integrating Apache Flume and Apache Kafka

  • Configuring Single Node Single Broker Cluster

  • Configuring Single Node Multi Broker Cluster

  • Producing and Consuming Messages

  • Flume Commands

  • Setting up Flume Agent

  • Streaming Twitter Data into HDFS

Module 10: Streaming – Multiple Batches

  • Why Streaming is Necessary?

  • Drawbacks in Existing Computing Methods

  • What is Spark Streaming?

  • Spark Streaming Features

  • Spark Streaming Workflow

  • How Uber Uses Streaming Data

  • Streaming Context & DStreams

  • Transformations on DStreams

  • Important Windowed Operators

  • Slice, Window, and ReduceByWindow Operators

  • Stateful Operators

Module 11: Apache Spark Streaming – Data Sources

  • Apache Spark Streaming: Data Sources

  • Apache Flume and Apache Kafka Data Sources

  • Example: Using a Kafka Direct Data Source

  • Perform Twitter Sentiment Analysis Using Spark Streaming

  • Streaming Data Source Overview

  • Different Streaming Data Sources

Module 12: Spark GraphX

  • Key Concepts of Spark GraphX

  • GraphX Algorithms and their Implementations


Related Courses

Advanced Electronic Document and Records Management
Dates: 22 - 26 Jul | 09 - 13 Sep | 11 - 15 Nov 2024

Advanced Electronic Document and Records Management

View More
Azure Big Data Analytics Training
Dates: Available on Request

Azure Big Data Analytics Training

View More
Basic Registry, Records, And Archives Management Course
Dates: 15 - 19 Jul | 16 - 20 Sep | 02 - 06 Dec 2024

Basic Registry, Records, And Archives Management Course

View More
Big Data Analytics with Python Training
Dates: Available on Request

Big Data Analytics with Python Training

View More

Open chat
Need Help? Chat with Us
Scan the code
Powered by Prospen Africa
Welcome to Prospen Africa!
Check out our 15% Off sale when you purchase QCTO Training Material