Key Apache Spark Trends from Spark Summit East 2017

I was lucky enough to attend Spark Summit East 2017 February 8-9. I had to brave the 12” of snow blizzard Nico brought to Boston, but overall learned a lot about the strategic direction of the Apache Spark open source project and ecosystem. In this post I’ll fill you in on the key Spark trends from the conference. This post will be helpful if you are part of an organization that is currently using Spark, looking to use Spark, or you are just personally interested in Spark.

Boston: Spark Summit East 2017 host city. Image source: https://pbs.twimg.com/media/C5DOQJpW8AIwACy.jpg
Boston: Spark Summit East 2017 host city. Image source: https://pbs.twimg.com/media/C5DOQJpW8AIwACy.jpg

Disclaimer: The opinions expressed in this article are my own and do not reflect the view of any other individuals or organizations.

Spark moves to Primetime: Key Spark Trends from Spark Summit East 2017

The overarching theme I took away from the conference was: Spark moves to Primetime. In other words, Spark is maturing from a tool used by “early adopter” analytics shops (small startups or corporations with mature analytics departments that use it mostly for development or offline analytics) to “early majority” analytics shops (large data-driven organizations that require tools be enterprise/production-ready). For those of you familiar with Geoffrey Moore’s book, Crossing the Chasm, this is the “Chasm” for Spark adoption.

Spark adoption reaches THE CHASM. Image source: http://i1.wp.com/www.theinnovativemanager.com/wp-content/uploads/2014/10/Product-life-cycle-crossing-the-chasm.jpg
Spark adoption reaches THE CHASM. Image source: http://i1.wp.com/www.theinnovativemanager.com/wp-content/uploads/2014/10/Product-life-cycle-crossing-the-chasm.jpg

The Spark in Primetime theme was clear in the conference’s significant focus on Enterprise/Production-ready Spark. There was much discussion about making Spark more real-time focused, secure, and easier to operate (e.g., easier for IT to deploy and manage; easier for data scientists to access). The whole second day of the conference was even themed Enterprise Day!

The Spark community made a lot of these enterprise/production-ready features available in the recently released Spark 2.0. The newly formed Berkeley RISELab, the successor to the Berkeley AMPLab, is looking to take them to the next level.

RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica

Easier to Operationalize Spark

The learning-curve and sustainability of a tool are frequently big hurdles to adoption. There was a strong focus on how to make Spark easier to operate (e.g., deploy, access, and manage) in an effort to decrease the learning-curve and make it a more sustainable solution.

Matei Zaharia, creator of the Spark project, discussed in his keynote how Spark helps “democratize access to data.” In other words, data engineers / scientists have the luxury of choosing from several programming languages when they write Spark code, including the more recent addition of R. Additionally, the high-level Spark APIs (Spark SQL, DataFrames, ML Pipelines, PySpark, SparkR) make Spark development easier.

Developers have several languages to pick from to use Spark, including the most recent addition of R
Developers have several languages to pick from to use Spark, including the most recent addition of R

In addition to easier access to data, many vendors in the Spark ecosystem pitched ways to simplify Spark deployment and management. The main sponsor, Databricks, heavily pushed the capabilities of their Spark cloud platform. The conference goodie bag had several flyers marketing “Data Science Sandbox” and “Data Science as a Service” solutions. BlueData presented on running Spark in Docker containers. Openshift discussed their platform, built on Docker and Kubernetes, that enables a “fully elastic Spark application with little more than the click of a button.” All this focus on making it easier to operationalize Spark will help drive overall Spark growth.

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by Tom Phelan
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by Tom Phelan

“Continuous Applications”

While there are countless tools available to address streaming, batch, or data serving analytics workloads, it is difficult to create apps that integrate all three workloads. With Spark 2.0 (specifically Spark Structured Streaming), Spark aims to provide a single API for what Databricks is coining “continuous apps” - apps that merge streaming and batch workloads.

Matei Zaharia discussed how continuous applications integrate streaming, batch, and data serving workloads
Matei Zaharia discussed how continuous applications integrate streaming, batch, and data serving workloads

Similarly, the Principal Product Manager for MemSQL, Steven Camina, discussed the massive opportunity of converting existing batch processes to real-time.

Building the Ideal Stack for Real-Time Analytics, presented by Steven Camiña (MemSQL)
Building the Ideal Stack for Real-Time Analytics, presented by Steven Camiña (MemSQL)

Going forward, the Berkeley RISELab will have a heavy focus on enabling “continuous apps” given the group’s “real-time” focus:

RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica

Faster Big Data Analytics

Right now, a major performance issue with big data analytics is the compute bottleneck. While storage and network performance improved ~10x since 2010, CPU performance has remained relatively stagnant.

In his keynote, Matei Zaharia discussed the existing compute bottleneck created by the limited increase in CPU performance since 2010
In his keynote, Matei Zaharia discussed the existing compute bottleneck created by the limited increase in CPU performance since 2010

To address this issue, the Spark community created Project Tungsten with the goal of optimizing Spark CPU and memory usage.

Impact of Project Tungsten slide from Matei Zaharia keynote
Impact of Project Tungsten slide from Matei Zaharia keynote

Security

A big hurdle to Enterprise adoption and Production use is security. Spark security is about to get a big upgrade with the introduction of the Berkeley RISELab.

As I mentioned earlier, the goal of the RISELab is to enable “Real-time decisions on live data with strong security.” The RISELab has two projects to address the “strong security” component:

  • Secure Real-time Decisions Stack (SRDS): open source platform to develop RISE apps, secure from the ground up
  • Opaque: secure distributed data analytics framework
Opaque: A Data Analytics Platform with Strong Security. Talk by Wenting Zheng
Opaque: A Data Analytics Platform with Strong Security. Talk by Wenting Zheng

What’s next?

As Spark makes the move to Primetime, crossing the chasm to the “early majority” of true enterprise adoption, will you take advantage of all that Spark can do for you?

To learn more about how Spark can help you, check out all the presentation videos and slides from Spark Summit East 2017.