Apache Spark
This is a fast and
general engine for large-scale data processing. Apache
Spark is an open source parallel
processing framework for running large-scale data analytics applications
across clustered computers. It
can handle both batch and real-time analytics and data processing workloads.
How Apache Spark works
Apache Spark can
process data from a variety of data repositories, including the Hadoop
Distributed File System (HDFS), NoSQL databases
and relational data stores, such
as Apache Hive.
Spark supports in-memory processing to boost the performance of big data
analytics applications, but it can also perform conventional
disk-based processing when data sets are too large to fit into the available
system memory.
The Spark Core engine
uses the resilient distributed data set, or RDD, as its basic data type. The
RDD is designed in such a way so as to hide much of the computational
complexity from users. It aggregates data and partitions it across a server
cluster, where it can then be computed and either moved to a different data
store or run through an analytic model. The user doesn't have to define where
specific files are sent or what computational resources are used to store or
retrieve files.
In addition, Spark can
handle more than the batch processing
applications that MapReduce is limited to running.
Spark libraries
The Spark
Core engine functions partly as an application programming interface
(API)
layer and underpins a set of related tools for managing and analyzing data. Aside
from the Spark Core processing engine, the Apache Spark API environment comes
packaged with some libraries of code for use in data
analytics applications. These libraries include:
- Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query data stored in disparate applications using the common SQL language.
- Spark Streaming -- This library enables users to build applications that analyze and present data in real time.
- MLlib -- A library of machine learning code that enables users to apply advanced statistical operations to data in their Spark cluster and to build applications around these analyses.
- GraphX -- A built-in library of algorithms for graph-parallel computation.
Spark languages
Spark was written in Scala,
which is considered the primary language for interacting with the Spark Core
engine. Out of the box, Spark also comes with API connectors for using Java and
Python. Java is not considered an optimal language for data engineering or data science,
so many users rely on Python, which is simpler and more geared toward data
analysis.
There is also an R programming
package that users can download and run in Spark. This enables users to run the
popular desktop data science language on larger distributed data sets in Spark
and to use it to build applications that leverage machine learning algorithms.
Apache Spark use cases
The wide range of Spark
libraries and its ability to compute data from many different types of data
stores means Spark can be applied to many different problems in many
industries. Digital advertising companies use it to maintain databases of web
activity and design campaigns tailored to specific consumers. Financial
companies use it to ingest financial data and run models to guide investing
activity. Consumer goods companies use it to aggregate customer data and
forecast trends to guide inventory decisions and spot new market opportunities.
Prerequisite:
Java should be pre-installed on the machines on which we
have to run Spark job.
Set the JAVA_HOME :
Download Apache spark from:
http://spark.apache.org/downloads.html
Install Scala Download
Scala from the link:
http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi
To install apache spark
in windows we will need to install it in standalone mode,
Windows:
Install Scala Download
Scala from the link: http://downloads.lightbend.com/scala/2.11.8/scala-
2.11.8.msi
Check installation:
Install Spark
1.6.1.Download it from the following link http://spark.apache.org/downloads.html and extract it into C drive such as c:\Spark
cd into the folder
c:\spark\bin
And execute
spark-shell, if successful should give you:
Download maven: apache.mivzakim.net/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip
Run on cmd mvn –version
to ensure correct installation.
Successful installation
of maven will display:
Linux Installation
1). Install java and verify installation using: $java –version if it is not installed install it using :
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
2). Ensure scala
language is installed (it is used to implement spark)
Use: $scala –version to check if installed if not,
apt-get install scala (use this to receive updates if exists)
3). Install spark , downloadit from http://spark.apache.org/downloads.html
apt-get install spark
Creating a Sample Project
Open eclipse, ensure
maven has been installed.
Create a new project
> Maven
On the foder structure
it should be as:
At the pom.xml set the
maven dependencies as :
At the app.java files:
Build the project.
Use mvn package
It will show the lines
in the file submitted.
To submit the job to
spark uses the syntax:
To give the output:
To access a job that
has been submitted use:
Spark Streaming
Spark Streaming is an
extension of the core Spark API that enables scalable, high-throughput,
fault-tolerant stream processing of live data streams.
Requirements
Download Netcat for
windows (a small utility found in most Unix-like systems)
The command to have
netcat listen on a specific port is “nc -l PORT_NUMBER”. If you run this on a
Windows 7 machine, you will get the message “local listen fuxored: INVAL”. The
fix is to run it with a -L option. So the command would like this
[code]nc -L -p
80[/code]
Now listening to port
4500
Check the streaming
process at : http://localhost:4040/jobs/
Do mvn package
Submit the job to spark
using:
No comments:
Post a Comment