IBM Research Blows Away Deep Learning Records

By Charles King, Pund-IT, Inc.

Popular references to artificial intelligence (AI) and other forms of computing that mimic human intelligence are so commonplace that it is easy to forget they depend on highly complex, highly robust fundamental technologies that continue to evolve.

That point underscores the breakthrough Distributed Deep Learning (DDL) software announced this week by IBM Research, and the remarkable results the company’s achieved. Saying IBM’s efforts blow away what people can expect from and achieve with Deep Learning and related projects might seem like an overstatement, but this is a case where undue modesty could lead to basic misunderstanding.

Let’s look at IBM Research delivered and how it changes the Deep Learning landscape.

Deep Learning: The elephant (and blind men) in the room

First, remember that, like every other form of computational innovation, AI evolution depends on system and software advancements. AI concepts initially arose among computing professionals in the 1950s. But they gained traction in popular science fiction stories and films far more quickly than it did in real world computation. Isaac Asimov’s I, Robot, HAL in 2001: A Space Odyssey, R2-D2 and C-3PO in Star Wars, and the cyborg assassins in The Terminator films all set a bar for what might be achieved by blending human and machine characteristics.

Unsurprisingly, the real world moves more slowly and messily than science fiction, especially when it comes to computationally-intensive projects. Just because you’d like to do or pursue something doesn’t justify costly means to achieve middling ends.

Early examples of AI relied on cumbersome, hand-written software routines that enabled systems to perform simple, narrow, human-like tasks. When machine learning (ML) emerged it offered AI proponents a major jump forward by training systems with algorithms and large data sets to analyze and learn about information, then determine next steps or form predictions about potential outcomes.

Deep Learning (DL) complements and has some similarities to ML (making them easy to confuse) but is also uniquely separate. In short, rather than relying on the algorithms that are central to ML, DL leverages “neural networks” (computational processes and networks that mimic functions of the human brain) and huge data sets to “train” systems for AI.

Not surprisingly, DL training processes are similar to those people use to learn new languages, mathematics and other complex skills. For example, language students utilize a variety of tools, including group recitation, spaced repetition of verb/noun forms, rote memorization, flash cards, games and reading/writing to steadily grow their understanding and capabilities.

Similarly, DL uses automated processes to train AI systems. Plus, GPU-based neural networks enable systems to ingest multiple threads of information and share information across the system, enabling them to “multi-task” in their training far better than do most human students. That allows AI systems to be trained for areas like enhanced speech and visual recognition, medical image analysis and improved fraud detection.

Key advances in graphic processor (GPU) technologies have drastically reduced the cost of neural networks and DL, bringing them to the fore of AI development. However, GPU-based systems used for DL training also suffer some significant, fundamental challenges, including scaling limitations and bottlenecks that hinder ever-faster GPUs from effectively synching and exchanging information.

In a blog post, Dr. Hillary Hunter, an IBM Fellow who led the company’s DDL team effort, compared this issue to the parable of the “Blind Men and the Elephant.” Each blind man describes the elephant according to the individual body part he touches but those partial experiences lead them to misunderstand the creature as a whole.

Hunter notes that given enough time, the group might “share enough information to piece together a pretty accurate collective picture of an elephant.” Similarly, information is shared and synched across multiple GPUs to collectively develop AI capabilities. The problem is that system performance bottlenecks hinders optimal performance and slows progress to a crawl.

IBM Research’s DDL code takes the field

How does this work in real world circumstances? While popular DL frameworks, including TensorFlow, Caffe, Torch and Chainer can efficiently leverage multiple GPUs in a single system, scaling to multiple, clustered servers with GPUs is difficult, at best. But that doesn’t mean that DL scaling is impossible.

As an example, in June Facebook’s AI Research (FAIR) team has posted record results for best scaling for a cluster with 256 NVIDIA P100 GPUs. The team used a ResNet-50 neural network model with a small ImageNet-1K dataset (with about 1.3 million images and a large batch size of 8192 images. Using that configuration, FAIR achieved respectable 89% scaling efficiency for a visual image training run on Caffe.

In addition, the effectiveness of DL training systems used for visual image training is measured according to dataset size, image recognition accuracy and the length of time required to perform a training run. Microsoft was the previous record holder in this instance. Utilizing a very large 7.5M images ImageNet-22k dataset, the company’s AI team achieved 29.8% recognition accuracy in a training run that took 10 days to complete.

How does IBM Research’s new Distributed DL achievement compare? The team utilized a cluster of 64 IBM Power System servers with 256 NVIDIA P100 GPUs to perform numerous image recognition training runs on Caffe. For a ResNet-50 model using the same dataset as the Facebook team, the IBM Research team achieved a near-perfect 95% scaling efficiency.

The team also used the Power/NVIDIA system to train a ResNet-101 neural network model similar to the one used by Microsoft’s team with the ImageNet-22k dataset and a batch size of 5120. Despite that far larger and more computationally complex dataset, the low communications overhead IBM cluster achieved a scaling efficiency of 88% or a smidgen less than FAIR’s best effort with a much smaller, simpler dataset.

Using the same large dataset, IBM also delivered significantly better, 33.8%, image recognition accuracy than Microsoft’s 29.8%. Most impressively, the IBM team completed its Caffe training run in just 7 hours compared to Microsoft’s previous record-holding 10 days. In other words, IBM’s DDL library offered near-perfect clustered system scaling, supported highly complex datasets and neural networks, and delivered notably better training results in a tiny fraction of the previous world record time.

Final analysis

To underscore the innovative scalability of IBM’s new Distributed DL solutions, Sumit Gupta, VP of the company’s Machine Learning and AI organization, detailed the team’s efforts using traditional software and a single IBM Power Systems server (a “Minsky” S822LC for HPC) with four NVIDIA P100 GPUs to train a model with the ImageNet-22k dataset using a ResNet-101 neural network.

Gupta noted that the 16 days required for the training run was “longer than any vacation” he’d taken. Joking aside, the results clearly highlight how long training runs and single system limitations result in data scientists facing tangibly delayed time to insight and limited productivity. In turn, those barriers on progress and payback inevitably impact many businesses’ enthusiasm for AI and related projects.

IBM is also taking a notably “open” approach to making its DDL software innovations available to the public. The DDL library is available in preview version in the latest V4 release of IBM’s PowerAI software distribution running on its Power Systems solutions. The Power AI V4 release includes integration of the DDL library and APIs into TensorFlow and Caffe, and IBM Research has developed a prototype integration with impressive results in Torch.

IBM noted its intentions to make the DDL cluster scaling features available to any organization using deep learning for training AI models. Given the remarkable advances in scaling efficiency, training results and overall performance offered by the company’s DDL innovations, it’s safe to say that IBM Research has blown away many of the inherent limitations of deep learning technologies in ways that alter the artificial intelligence market landscape for the better.