By Charles King, Pund-IT, Inc. June 17,2015
IBM announced a major commitment to Apache Spark which the company described as potentially the most important new open source project in a decade that is being defined by data. According to IBM, as data and analytics are embedded into the fabric of business and society – from popular apps to the Internet of Things (IoT) – Spark brings essential advances to large-scale data processing:
- First, it dramatically improves the performance of data dependent applications.
- Second, it radically simplifies the process of developing intelligent apps that are fueled by data.
To further accelerate innovation for the Spark ecosystem, IBM plans to:
- Build Spark into the core of the company’s analytics and commerce platforms.
- Leverage Spark as a key underpinning for its Watson Health Cloud platform, helping to deliver faster time to insight/value for medical providers and researchers leveraging analytics around health data.
- Open source the IBM SystemML machine learning technology and collaborate with Databricksto advance Spark’s machine learning capabilities.
- Offer Spark as a Cloud service on IBM Bluemixto make it possible for developers to quickly load data, model it and derive predictive artifacts.
- Commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen IBM labs worldwide, and continue to build out a Spark Technology Center in San Francisco to aid the data science and developer communities and foster design-led innovation.
- Educate more than 1 million data scientists and data engineers on Spark via partnerships with AMPLab, DataCamp, MetiStream, Galvanizeand Big Data University
IBM’s plans for Apache Spark highlight the company’s continuing support for complementary, innovative open source technologies.
Attending this week’s IBM Apache Spark analyst session and the following Spark community event at Galvanize (a San Francisco facility for educating entrepreneurs and technologists) felt like old home week to me. After starting work as an IT industry analyst in the late 1990s, one of the first vendor events I attended focused on IBM’s decision to become the first Tier 1 vendor to throw its considerable financial weight and technological acumen behind Linux and related open source projects.
Why the company would do such a thing frankly baffled and even frightened people inside and outside of the company. In fact, Robert LeBlanc (currently the SVP of IBM Cloud) who chaired the group that looked into Linux and recommended that the company support the nascent OS, was reportedly targeted for dismissal by affronted members of IBM’s board of directors. Fortunately, then CEO Lou Gerstner ignored them and Linux became the first of many open source efforts IBM has supported.
Those unimaginative folks and others couldn’t imagine what a vendor as thoroughly steeped in traditional enterprise computing as IBM could possibly gain from a technology and community whose efforts flaunted and threatened IT convention. Their fears seem quaint, at best, today given the critical position that Linux occupies in IT solutions and data centers of nearly every kind, and the singular role it and other open source technologies have played in preserving and furthering numerous IBM platforms and initiatives.
Casting a light on Spark
From the tenor of IBM’s announcement and the related analyst sessions in San Francisco, it seems reasonable to assume that the company holds a similar or even greater optimism for Apache Spark, especially given the size of its commitment. But why is that the case? IBM’s delineation of Spark’s benefits – dramatically improved performance and simplified development for data dependent applications – is fundamentally correct but more intriguing than inspiring. So what gives? What’s sparking (pun intended) IBM’s intense interest?
In essence, Apache Spark is a cluster computing framework that complements Hadoop but provides radically faster (up to 100X in some applications) performance than the traditional two-stage, disk-based data processing model supported by Hadoop’s MapReduce. Due to those performance characteristics, Spark can be used for scenarios like fast interactive queries and real-time stream data processing that have mostly eluded Hadoop-based solutions. That means that Spark has particular affinities for a range of existing and prospective IBM analytics and big data solutions.
Apache Spark is also highly flexible in terms of development environments, scalability and the applications for which it can be used. Spark supports Java, Python and Scala APIs, and scales up to 8,000 nodes in production. Along with conventional Hadoop use cases and scenarios, Spark has a special affinity for machine learning workloads, underscoring IBM’s decision to open source its SystemML machine learning OS and closely collaborate with Databricks (a start-up founded by the folks who created Spark at UC Berkeley’s AMPLab).
Though Spark requires a cluster manager and a distributed storage system, it is anything but limited in these regards. Spark supports Apache Mesos or Hadoop YARN for cluster management and, in terms of distributed storage, can interface with many solutions, including Amazon S3, Cassandra, Hadoop Distributed File System (HDFS) and OpenStack Swift. IBM’s decision to support Spark in its BlueMix developer cloud should complement these capabilities.
Fanning Spark into a flame
In essence, its ability to leverage numerous open source tools and solutions, in-memory technologies and commodity hardware means that Spark can support highly cost-effective solutions for a wide variety of big data problems, thus extending the applications and workloads to which Hadoop can be applied.
What does this mean in the real world of enterprise data centers? IBM considers Spark one of the most significant open source projects of the past decade, and believes it can expand or extend virtually all of the company’s analytics and big data business offerings. That could be hugely impactful for IBM, along with its existing and future customers.
But the company also believes that view only scratches the surface of Spark’s potential. Beth Smith, GM of IBM’s Analytics Platform division noted human genomic sequencing as an example. When it first became technologically possible to sequence a single human genome, the process cost $1M. That has since dropped to about $5K, still well above the $1K threshold where genomic sequencing could be used commercially.
However, Smith noted that one Spark workgroup is aiming to develop a solution that will sequence an individual human genome for a mere $50.00, a goal that would unleash an enormous number of new healthcare business innovations and patient benefits. That Smith described Apache Spark as “the analytics operating system” signals the depth of IBM’s feelings about the technology and its potential.
Shades of open source past
This week’s Spark announcement and events were heady stuff, and seemed to capture the energy and imagination of the mostly younger tech crowd in San Francisco. But to paraphrase author Gertrude Stein’s line about San Francisco’s cross-bay neighbor Oakland (and home to the new NBA champion Golden State Warriors), “Is there any ‘there’ there” when it comes to Apache Spark? In short, absolutely. Especially in terms of potential technological innovation, Spark is fully capable of becoming a key enabler of a wide range of analytics projects and efforts. If that comes to pass, Apache Spark could become as instrumental to big data as Linux has been to cloud computing.
But it is also extremely early in terms of Spark development, enough so that IBM’s monetary and technological investments are analogous to its Linux activities in the late 1990s. The company isn’t alone in its Spark efforts – virtually every major IT vendor is supporting Spark directly or via big data-focused specialists like Cloudera. However, from what I saw in San Francisco this week, the size and scope of IBM’s commitment to Spark appear to be considerably larger than other vendors.
That said, the challenges of early market adoption remain, especially in terms of necessary skills and helping IT customers and markets fully understand Spark’s full potential. That means that nurturing Apache Spark will require substantial time, patience and investment. Those qualities are as rare in IT as they are in any other industry but, given its past open source successes, such an effort seems entirely within IBM’s grasp.
© 2015 Pund-IT, Inc. All rights reserved.