IBM’s Data Science Experience: A Lesson in Open Source Community Building

By Charles King, Pund-IT, Inc.  June 8, 2016

IBM’s launch of its new Data Science Experience, an IBM Cloud-based development environment for real time, high performance analytics leveraging the Apache Spark framework can be interpreted on a number of levels.

  • Strategically, it spotlights the company’s ongoing commitment to developing Apache Spark analytics solutions and services.
  • Tactically, coming a year after IBM’s announcement of $300 million in planned Spark investments, it demonstrates the continuing progress of that commitment.
  • Financially, the effort’s impact on the proliferation of data scientists could enhance IBM’s efforts to integrate Spark into its own commercial portfolio.
  • Technologically, the Data Science Experience should help the company maintain its high profile as a forward-thinking leader in Apache Spark, advanced analytics and cognitive computing.

But more broadly speaking, the Data Science Experience also underscores IBM’s practical experience in building and nurturing open source communities. That’s a subject worth closer consideration.

Opening minds on open source

The great grandparent of all IBM open source efforts was the company’s 1999 announced investment of $100 million in Linux development and related integration of the fledgling operating system into all IBM server platforms, beginning with the venerable z Series (now z System) mainframes.

That was considered controversial, even radical within the company, to the point that some IBM board members reportedly tried to sink the effort and have its proponents fired. But then-CEO Sam Palmisano blessed the plan and the company pushed ahead, becoming the first Tier 1 IT vendor to publicly support Linux, a position that delivered substantial immediate and subsequent benefits.

Over the following years, IBM endorsed and supported other similar efforts, including open sourcing some of its own core technologies. Those ranged from Eclipse, an integrated development environment (IDE) designed to replace Java to IBM’s POWER microprocessor architecture to (most recently) SystemML which was open sourced in 2015 to advance Spark’s machine learning capabilities.

It isn’t an overstatement to say that IBM’s Linux and other open source efforts helped the company bolster its industry reputation and expand its market opportunities. That didn’t occur entirely through capturing commercial benefits, including sales and other engagements.

By being one of the few traditional IT vendors that “get” the true value of open source, IBM has developed valuable partnerships with innovative start-ups, some of which were later acquired. In turn, executives from those firms have become next-generation leaders within IBM, highlighting the long term cultural benefits that open source has provided.

Nurturing the Spark community

IBM certainly isn’t alone in its enthusiasm for the developers, engineers and scientists who inhabit open source communities. Other software-centric vendors, including Amazon, Apple, Facebook, Google, Microsoft, Oracle, SAP and many others invest heavily in similar outreach efforts.

But creating tools and applications for emerging analytics platforms, such as Apache Spark presents significant challenges for those constituencies. Why so? First, because successful efforts require access to sophisticated hardware and sizable data assets. In addition, the nascent state of the market for these solutions makes it difficult to find or justify investing in those assets.

IBM’s new Data Science Experience is designed to obviate those issues. By providing access to an open and collaborative Apache Spark environment and 250 curated datasets, the company aims to inspire and benefit data engineers, data scientists and application developers.

The Data Science Experience will help users ease and speed data ingestion, curation and analysis processes by providing access to content, data, models and open source resources from IBM and others including H2O, RStudio, Jupyter Notebooks on Apache Spark in a single, secure, managed environment. That, in turn, should enhance development and education efforts.

The effort underscores IBM’s continuing collaboration with data science organizations, including Galvanize,, LightBend and RStudio. But the Digital Science Experience also reflects the company’s work on related projects, including Apache Toree, Apache Quarks, Apache Mesos, Apache Tachyon (now called Alluxio) and EclairJS, as well as the over 3,000 contributions IBM has made during the past year to Apache Spark sub-projects; SparkSQL, SparkR, MLLib and PySpark.

Final analysis

One could successfully argue that the Data Science Experience is essential for Apache Spark. But given IBM’s years of open source investment and success, this new effort is also entirely expected—part and parcel of the commitment to open source long embedded in the company’s DNA. That doesn’t detract from the value the Data Science Experience will provide data engineers, data scientists and application developers focused on Apache Spark.

But it does offer insights into the value that IBM regularly provides and realizes from its open source contributions. Some of those benefits are clearly commercial, allowing the company to develop and deliver new innovations at a faster pace than other vendors. But some are less tangible, like the enthusiasm of new leaders and employees whose talents will influence and enhance IBM and the communities to which it belongs for years to come.

© 2016 Pund-IT, Inc. All rights reserved.