Apache Spark is a engine for large-scale data processing. It is optimized for the execution of multiple parallel operations on the same data set as they occur in many iterative machine learning tasks.

In Spark, data is stored as resilient distributed datasets (RDDs), i.e., an immutable, persistent set of objects that is partitioned across several computers. Two types of operations are defined on RDDs:

  1. Transformations: RDD x Function -> RDD
    apply a function on an RDD and create a new RDD
  2. Actions: RDD x Function -> Object
    compute a result from an RDD

These operations are executed lazily. This means that Spark stores the sequence of operations performed on an RDD. These operations are only executed, if (i) an action is performed and a result has to be computed or (ii) the computation of an RDD is explicitly requested.

The execution of a Spark application starts with the driver program. It acquires resources from an external cluster management system like YARN and instructs it to start the processes of the application.

Additional APIs


Newer versions of Spark include high level APIs for dealing with data, including Spark SQL, Spark DataFrames and Datasets (Datasets does not currently support Python). These APIs include additional information beyond what is provided in the RDD to enable "extra optimizations".[1]


  1. http://spark.apache.org/docs/latest/sql-programming-guide.html