IBM DataWorks – Looking for (and Finding) the Next Big Data Thing

By Charles King, Pund-IT, Inc.  September 28, 2016

How vendors support open source technologies doesn’t follow a set path. Many start-ups and smaller companies (along with once-small, now-prosperous businesses) make near or complete commitments to open source for both philosophical and financial reasons. Many leverage Linux and other specific platforms at their developers’ or customers’ behest while others attempt to bend the technology to their own will or competitive advantage by creating proprietary forks.

That’s a critical issue because of the natural ways in which open source development evolves along with and very often leads innovative industry trends. For example, the big data technologies and initiatives roiling IT over the past half-decade have been driven by open source Apache Software Foundation projects, including Hadoop, MapReduce and Spark. In fact, vendors that ignore or fail to support these products find themselves falling quickly behind the curve technologically and competitively.

IBM has pursued a singular path in its own open source efforts. In the late 1990s, the company committed $1B to develop Linux solutions for its z Systems mainframes and other server platforms. That was followed by substantial investments in a range of platform and community-building projects, along with open sourcing technologies the company developed in-house with Eclipse and the POWER chip architecture among the most notable of these.

The company also proactively invested in Apache Spark and other complementary advanced analytics and big data projects. What does any of this have to do with anything? IBM’s newly announced Project DataWorks qualifies as both the culmination of and a natural next step for the company’s open source big data strategies and goals. As such, it’s worth close consideration.

What is Project DataWorks?

The company calls Project DataWorks “an initiative from (IBM) Watson that is the industry’s first cloud-based data and analytics platform to integrate all types of data and enable AI-powered decision making.” But what does that mean exactly, and why should businesses care?

IBM notes that organizations recognize the competitive advantages and other benefits they can achieve by effectively leveraging their data assets, but two things stand in their way. First, business data is huge and hugely complex, incorporating numerous media, documents, structured and unstructured formats and business processes. Second, data is never static and grows at an astounding rate—roughly doubling every two years.

To paraphrase Rust-Oleum’s famous advertising tag line – Data never sleeps. Nor do the commercial organizations that hope to reap data-related benefits. To do so, they must regularly ingest massive amounts of new information and iterate (often manually) their data models and products. Otherwise advanced analytics efforts risk being inaccurate and irrelevant.

Additionally, most big data efforts tend to focus on and be shaped by data scientists and other highly skilled professionals. Their efforts can be highly valuable to their organizations but too often they depend on specialized tools and services that become siloed and inaccessible to other interested parties. Those same solutions can also be hard to manage and integrate with other processes, subtracting from rather than adding to the business.

IBM believes DataWorks can effectively address these and other problems.

How Project DataWorks works

How does the company achieve that? In essence, the DataWorks Project is designed to ingest and integrate all types of data, make it easy to collect, organize, govern and secure information for projects, and enable enhanced decision making with IBM’s cognitive-assisted artificial intelligence (AI) tools and services.

Project DataWorks is available on Bluemix, IBM’s Cloud platform, meaning it can be accessed from virtually anywhere and easily configured to support most business processes, strategies or use cases. The new solution also leverages a number of innovative open source technologies, including Apache Spark, as well as IBM Watson Analytics and IBM Data Science Experience.

Interestingly, IBM noted that Project DataWorks was designed with the same approach used by The Weather Channel, whose digital assets IBM acquired last October. Those include a flexible data architecture, rapid ingestion of multiple data sources and Internet-scale data processing and analytics complemented by IBM’s Watson cognitive technologies and services.

As a result, Project DataWorks can ingest data at rates of fifty to hundreds of Gbps (which IBM says is faster than any other commercial platform). Plus, it can utilize data from all endpoints, including enterprise databases, weather sources, social media sites and Internet of Things (IoT) sensors. The company noted that customers can also leverage an open ecosystem of 20+ partners and technologies such as Alation, Confluent, Continuum Analytics, Galvanize, Skymind, NumFOCUS and RStudio.

Final analysis

What does all this mean in a larger context, especially as it pertains to IBM’s open source history and strategy? More so than most other major vendors, IBM recognized that open source technologies and communities represent a fundamental shift in the way that individuals, groups and organizations create, utilize and benefit from technology. The shift to collaborative open development portended significant collaborative innovations and progress whose results are apparent in numerous open source projects.

However, IBM also understands the value of its own technologies, and how resulting solutions and services can complement the efforts of the open source partners and communities. Project DataWorks is the result of just that sort of innovative collaborative engagement, and IBM customers will be the ultimate beneficiaries.

© 2016 Pund-IT, Inc. All rights reserved.