By David Hill, Mesabi Group February 17, 2016
Please note: This guest commentary is by independent IT industry analyst David G. Hill, principal of the Mesabi Group.
Apache Spark is an open-source cluster computing framework for managing big data infrastructures that often leverage OpenStack Swift object stores. But organizations managing large clusters of compute and storage resources for multiple projects are finding themselves faced with challenges, such as performance and cost, that Swift by itself was not designed to handle. That is why IBM has introduced IBM Platform Conductor for Spark, a software offering whose capabilities help enterprises meet those challenges.
But before we get into IBM Platform Conductor for Spark, we need some context, including understanding that the new data-driven world where analytics shines is different from the older application driven IT world. Apache Spark is one of the new technologies that can make the data-driven world work efficiently and effectively, which is why IBM has made a major commitment to it.
The data-driven world has a non-traditional HW/SW infrastructure
Traditionally, software intelligence was application-driven where data was created to serve the purpose of the application (think online transaction processing systems). While that type of software intelligence will continue to be essential, we are now moving to an era of software intelligence that is data-driven, extracting meaning from data and thereby improving business decision making through the use of analytics and business intelligence. The data-driven era encompasses big data in a very large and comprehensive sense, but also includes data warehousing, Web search engines, sensor-based analysis as part of the Internet of Things, and cognitive analytics solutions, like IBM Watson.
However, the data-driven world is not only about the software intelligence, but also encompasses IT infrastructures that support data-driven projects, very notably, big data projects. For example, clusters of storage-rich servers (combining both compute and storage as commodity resources) are likely to be used rather than having servers access shared storage in the form of a storage area network. New software technologies, such as Hadoop and MapReduce, have emerged as critical resources in such projects.
The rise of Spark
Hadoop delivers robust, distributed processing of massive amounts of unstructured data across those clusters of storage-rich servers. MapReduce parcels out work (from queries) to the various nodes within the cluster. It organizes and reduces the results of that work on each of the nodes to provide answers to the queries.
However, MapReduce‘s inherent limitations are among the reasons driving development of Spark, which is much more comprehensive and powerful than MapReduce. For example, Spark focuses on in-memory technologies where possible, whereas MapReduce leverages traditional hard disk storage. When in-memory techniques can be employed, Spark is said to be two orders of magnitude faster than MapReduce. Even in cases where Spark utilizes hard disk storage, it is said to have an order of magnitude improvement over MapReduce.
IBM’s major commitment to Spark
How important is Spark? IBM believes it is “potentially the most significant open source project of the next decade.” The company has made a major commitment to Spark, including plans to embed Spark into its analytics and commerce platforms, in addition to offering Spark as a service on IBM Cloud. Moreover, the company committed to putting more than 3,500 IBM researchers and developers to work on Spark-related projects at more than one dozen labs worldwide. All in all, IBM expects to invest $300 million in Spark-related activities over the next few years.
So what is IBM doing to help organizations deploying big data converged compute and storage infrastructures to best take advantage of Spark technology? One answer is IBM Platform Conductor for Spark.
Enter IBM Platform Conductor for Spark
If organizations were only running one or a few servers with dedicated storage, managing big data projects might be fairly straightforward. But many organizations are rapidly broadening their number of big data projects, often with each project running on a separate, dedicated cluster. If clusters are unable to share their compute and storage resources, they can effectively become siloes that run the risk of low resource utilization where at least some compute resources are idle much of the time.
Creating a shared infrastructure that makes lots of computers look like one is the role of IBM’s Platform Conductor for Spark offering. One of the benefits of a shared infrastructure is that it cuts the number of server cores required to handle peak capacity demands. Not only does that reduce costs through fewer resources required, but through better workload balancing and resource management, faster results are obtained. It’s a win-win world, with users getting the insights and benefits from their analytics efforts faster and management enjoying better cost efficiencies. How exactly does IBM’s Platform Conductor for Spark achieve this?
- Workload balancing and resource management — IBM’s offering combines Spark itself with the company’s workload and resource management software, along with data management software from IBM’s Spectrum Scale product. IBM’s Platform Conductor for Spark facilitates a multi-tenancy environment, meaning that different projects take advantage of physical resources simultaneously and dynamically, although users and their data are isolated logically. IBM Platform Conductor for Spark provides QoS capabilities through a resource management policy engine. In addition, workload balancing and resource management software functionalities aggregate all converged compute and storage resources into one logical pool and provide the access capabilities to actually tap in and run workloads.
- Data management — An IBM Platform Conductor for Spark environment allows organizations to continue to use existing tools, such as HDFS (Hadoop Distributed File System), as much as they like. However, with the inclusion of IBM’s Spectrum Scale FPO technology as part of the offering, data management capabilities are greatly enhanced. Spectrum Scale has its own file system, which is POSIX-compliant. All nodes see all file data and any node in the cluster can concurrently read or update a common set of files. This means that applications can scale out easily. That scale can extend to petabytes of data and billions of records. In addition, Spectrum Scale works across tiers of storage, and that includes not only SSDs and disks, but also tape, where large archives of infrequently accessed data can be stored cost efficiently. Moreover, Spectrum Scale provides a wide range of other capabilities, including those associated with high availability.
In the world of big data infrastructures where Spark is taking an ever more prominent and dominant role, IBM is positioning its Platform Conductor for Spark offering as a valuable management platform. Though Spark is a powerful solution, it was not intended to provide all the extra workload balancing and resource management and data management capabilities that IBM’s offering provides. Organizations could conceivably roll their own by collecting and integrating various pieces of software, but IBM’s unified Platform Conductor for Spark offering makes life easier for organizations and speeds up getting the benefits in a Spark-enabled data-driven infrastructure world.
© 2016 Mesabi Group. All rights reserved.
About the Mesabi Group
The Mesabi Group (www.mesabigroup.com) helps organizations make their complex storage, storage management, and interrelated IT infrastructure decisions easier by making the choices simpler and clearer to understand.