In this paper, we take a very unusual approach to the design of a deep neural network accelerator in hardware: for us, the nodes in the neural network are Boolean lookup tables.
We were motivated initially by the fact that in very low precision FPGA neural network architectures, lookup tables are often used for arithmetic, but they are often used for very specific functions: while a -LUT is capable of implementing any nonlinear Boolean function with inputs, it ends up getting used for only a tiny fraction of these functions. A good example is binarised neural networks (BNNs) such as FINN, where LUTs end up being used to implement XNOR gates (multiplication over ) and popcount functions. Our research question is therefore: rather than restricting ourselves to these functions, can we make better use of the LUTs by embracing the nonlinearity and the -input support they give us?
We show that this is indeed possible. Our basic approach is to start with a weight-binarised neural network, add inputs to each node to bring them up to support, and then retrain the Boolean function implemented by that node. Retraining Boolean functions is a bit tricky, of course, because neural network training algorithms are not designed for this purpose. We generate a smooth interpolating function over the LUT entries, allowing us to use standard neural network training software (we use TensorFlow).
The end result is that the re-trained neural network is far more prunable than the original, because the extra inputs to the -LUTs compensate for the removal of other nodes. Thus we end up with a much sparser neural network for the same classification accuracy. The sparsity improves our area by a factor of two or more, yet the more complex inference functions at each node are effectively provided “for free” by the FPGA architecture.
I very much enjoyed the format of the meeting: select interesting speakers and allow them 30 minutes to talk about a topic of their choosing related to the theme of the meeting, with 10 full minutes for discussion and in-depth questions after each talk. Posters were also presented by a wide variety of researchers, with each poster presenter given a one-minute lightning-talk slot. Two of my PhD students, Erwei Wang and He Li, took this opportunity. Erwei presented a preview of our LUTNet paper appearing at FCCM very soon (separate blog post to follow), while He presented some of our work on arbitrary precision iterative compute.
Talks by others included:
David Keyes (KAUST) on the topic “Hierarchical algorithms on hierarchical architectures”. He discussed some very interesting hierarchical low-rank decompositions and also hierarchies of numerical precision.
Kathy Yelick (Berkeley) spoke on “Antisocial parallelism: avoiding, hiding and managing communication”, a very fruitful area of research in recent years. A few years ago, Abid Rafique, one of my former PhD students (joint with Nachiket Kapre) made use of this work, and it was good to catch up with the current state of research.
Anna Scaife (Manchester) gave a fascinating insight into the Square Kilometre Array. The sheer volumes of data are mind boggling (zettabytes annually) and pose unique algorithmic challenges.
Michela Taufer (UTK) discussed molecular dynamics workflows, and how we may be able to harness machine learning to reduce the human bottlenecks in such workflows in the future.
Rick Stevens (Argonne) gave a very engaging talk about the intersection of machine learning with computational science, exemplified by the Candle project, using deep learning in cancer research. He mentioned many of the emerging architectures for deep learning and their optimisation for low-precision compute. Interested readers may also like our recent survey article on the topic.
John Shalf (LBNL) spoke about alternative computational models beyond CMOS, new materials for switches, and the growth of hardware specialisation. He proposed the strategy of: hardware-driven algorithm design, algorithm-driven hardware design, and co-design of hardware and algorithm. Having worked in the FPGA community for decades where this has been our mantra, it is great to see hardware specialisation spreading the message of co-design into the HPC community.
Doug Kothe (ORNL) provided a very interesting insight into exascale computational motifs at the DOE.
Tony Hey (STFC) set out a compelling argument that academics in applied deep learning should focus on deep learning for scientific data, on the basis that (i) scientific data sets are huge and open and (ii) head-to-head competition with industrial giants on profit-oriented deep-learning applications, without access to their data sets, is a poor choice for academia. I think the same argument could be made for academic computer architecture too. His team are developing benchmarks for scientific machine learning, complementary to MLPerf.
Erin Carson (Charles University) presented an enchanting talk on iterative linear algebra in low and multiple precision compute. I’m a fan of her earlier work with Nick, and it was great to hear her current thinking and discussion of least-squares iterative refinement.
Steve Furber (Manchester) spoke about arithmetic in the context of the SpiNNaker machine, and a particular approach they have taken to numerical solution of neural ODEs in fixed-point arithmetic, demonstrating that stochastic rounding can radically improve the quality of their results.
Tim Palmer (Oxford) argued for low precision compute in weather and climate models, allowing the recouped computational cost to be recycled into better resolution weather models, resulting in higher overall accuracy. This reminded me of the argument I made with my PhD student Antonio Roldão Lopes and collaborator Eric Kerrigan in our paper More FLOPS or More Precision?
Guillaume Aupy (INRIA) discussed memory-efficient approaches for automatic differentiation and back-propagation.
Satoshi Matsuoka (RIKEN Centre) took us through the work being done on Post-K, a new Japanese supercomputer being designed to provide compute infrastructure for future workloads at the intersection of big data and AI/ML.
Mike Heroux (Sandia) spoke about his work developing programming infrastructure for future HPC, in particular for performance portability and for system reliability.
My own talk was entitled “Rethinking Deep Learning: Architectures and Algorithms” – I will save summarising the content for a future blog post. Slides for all these talks will appear on the Manchester Numerical Linear Algebra group website. In addition, each speaker has received an invitation to author an article for a special issue of Philosophical Transactions A – this should be a very interesting read.
I was impressed by the great attendance at the meeting and by the quality of the technical interaction; I met several new and interesting people at the intersection of numerical analysis and scientific computing.