The central theme of the paper is that hardware accelerator circuits for neural networks can actually be treated as neural networks. Consider the two graphs below. One of them represents a simple deep neural network where each node performs an inner product and a ReLU operation. The other represents the result of (i) deciding to used 4-bit fixed-point arithmetic, and then (ii) synthesising the resulting network into a circuit, technology-mapped to 2-input Boolean gates.
Although there are obvious differences (in structure, in number of nodes) – there is a core commonality: a computation described as a graph operating on parameterisable functional nodes.
So what can we gain from this perspective?
1. Binarized Neural Networks are universal. The paper proves that any network you want to compute can be computed using binarized neural networks with zero loss in accuracy. It’s simply not the case that some problems need high precision. But, as a corollary, it is necessary to not be tied too closely to the original network topology if you want to be guaranteed not to lose accuracy.
2. Boolean topologies are tricky things. So if we want to rethink topologies, what first principles should we use to do so? In the paper, I suggest looking to inspiration from the theory of metric spaces as one step towards producing networks that generalise well. Topology, node functionality, and input / output encoding interact in subtle, interesting, and under-explored ways.
I’m hugely excited by where this work could go, as well as the breadth of the fields it draws on for inspiration. Please do get in touch if you would like to collaborate on any of the open questions in this paper, or any other topic inspired by this work.
I have been reading Spivak’s book ‘Category Theory for the Sciences’ on-and-off for the last 18 months during vacations as some of my ‘holiday reading.’ I’ve wanted to learn something about category theory for a while, mainly because I feel it’s something that looks fun and different enough to my day-to-day activity not to feel like work. I found Spivak’s book on one of my rather-too-regular bookshop trips a couple of years ago, and since it seemed like a more applied introduction, I thought I’d give it a try. This blog post combines my views on the strengths of this book from the perspective of a mathematically-minded engineer, together with a set of notes on the topic which might be useful for me – and maybe other readers – to refer back to in the future.
The book has a well-defined structure. The first few chapters cover topics that should be familiar to applied mathematicians and engineers: sets, monoids, graphs, orders, etc., but often presented in an unusual way. This sets us up for the middle section of the book which presents basic category theory, showing how the structures previously discussed can be viewed as special cases within a more general framework. The final chapters of the book discuss some topics like monads that can then be explored using the machinery of this framework.
Chapter 3 covers the category of sets.
One of the definitions here that’s less common in engineering is that of coproduct. The coproduct of sets and , denoted is defined as the disjoint union of and . I first came across disjoint union as a concept when doing some earlier holiday reading on topology. We also here get the first experience of the author’s use of informal ‘slogans’ to capture the ideas present in more formal definitions and theorems: “any time behavior is determined by cases, there is a coproduct involved.” I really like this approach – it’s the kind of marginalia I usually write in books myself, but pre-digested by the author!
The notation is used for the set of functions from to , which we are encouraged to refer to as morphisms, all without really explaining why at this stage – it of course all becomes clear later in the book.
To give a flavour of the kind of exercises provided, we are asked to prove an isomorphism . This reminds me of digital circuit construction: the former corresponds to a time-multiplexed circuit capable of producing outputs from with inputs from either or , together with an identifier for which (to feed a multiplexer select line); the latter corresponds to two circuits producing both the outputs rather than just the selected one.
The fibre product, , given functions and is defined as the set . Any set isomorphic to this fibre product is called a pullback of the diagram .
The fibre sum, , given functions and is the quotient of by the equivalence relation generated by and for all . A pushout of and over is any set which is isomorphic to .
A good intuitive description in the text is to think of the fibre sum of and as gluing the two sets together, with as the glue. A nice example from one of the exercises is , , a singleton set, defined by . Then we’re left with , where is an element corresponding to the collapsing of all negative elements in to a single point.
The familiar notions of surjective and injective functions are recast as monomorphisms and epimorphisms, respectively. A function is a monomorphism if for all sets and pairs of functions , if then . A function is an epimorphism if for all sets and pairs of functions , if then . By this time in the book, we get the hint that we’re learning this new language as it will generalise later beyond functions from sets to sets – the chapter ends with various slightly more structured concepts, like set-indexed functions, where the idea is always to define the object and the type of morphism that might make sense for that object.
Monoids, Groups, Graphs, Orders & Databases
In Chapter 4, the reader is introduced to several other structures. With the exception of the databases part, this was familiar material to me, but with slightly unfamiliar presentation.
We learn the classical definition of a monoid and group, and also discuss the free monoid generated by a set as being the monoid consisting of set of lists of elements of , with the empty list as the identity and list concatenation as the operation. We discuss the idea of a monoid action on a set , i.e. a function associated with the monoid having the properties for any , where is the identity, and , where and are any elements of , is any element of , and is the monoid operation. The discussion of monoids is nice and detailed, and I enjoyed the exercises, especially those leading up to one of the author’s slogans: “A finite state machine is an action of a free monoid on a finite set.”
Graphs are dealt with in a slightly unusual way (for me) as quadruples of a set of vertices , a set of arrows , a functions , mapping each arrow into its source and destination vertex. This corresponds then to what I would call a multigraph, as parallel edges can exist, though it is called a graph in the book.
For each of the structures discussed, we finish by looking at their morphisms.
As an example, let and be monoids (the tuple is of the set, identity element, and operation). Then a monoid homomorphism is a function preserving identity and operation, i.e. (i) and for all .
Similarly, for graphs, graph homomorphisms for and are defined as pairs of functions and preserving the endpoints of arrows, i.e. such that the diagrams below commute.
Order morphisms are similarly defined, this time as those preserving order.
Thus we are introduced to several different (somewhat) familiar mathematical structures, and in each case we’re learning about how to describe morphisms on these structures that somehow preserve the structure.
Categories, Functors, and Natural Transformations
Things really heat up in Chapter 5, where we actually begin to cover the meat of category theory. This is exciting and kept me hooked by the pool. The unusual presentation of certain aspects in previous chapters (e.g. the graphs) all made sense during this chapter.
A category is defined as a collection of various things: a set of objects , a set of morphisms for every pair of objects and , a specified identity morphism for every object , and compositions . To qualify as a category, such things must satisfy identity laws and also associativity of composition.
Examples of the category of Monoids, of Groups, of Sets, of Graphs, of Pre-Orders, etc., are explored. Isomorphisms between objects in a category are carefully defined.
Functors are introduced as transformations from one category to another, via a function transforming objects and a function transforming morphisms, such that identities and composition and preserved. Functors are then used to rigorously explore the connections between preorders, graphs, and categories, and I finally understood why the author had focused so much on preorders in the previous chapters!
Things started to take a surprising and exciting turn for me when the author shows, via the use of functors, that a monoid is a category with one object (the set of the monoid becomes the morphisms in the category), and that groups can therefore be understood as categories with one object for which each morphism is an isomorphism. This leads to a playful discussion, out of which pops groupoids, amongst other things. Similar gymnastics lead to preorders reconstituted as categories in which hom-sets have one or fewer elements, and graphs reconstituted as functors from a certain two-object category to . Propositional logic is re-examined from a categorical perspective, though I suspect this is just scratching the surface of the interplay between category theory and logic, and I’d love to read more about this. For me, this was a highlight – that by playing with these definitions and trying things out, all these wonderful familiar structures arise.
The final part of this chapter deals with natural transformations. Just as functors transform categories to categories, natural transformations transform functors to functors. Specifically, a natural transformation between functor and is defined as a morphism (in ) for each object in , such that the diagram below commutes for every morphism in .
I think this section on natural transformations is where the huge number of exercises in the book really came into its own; I doubt very much I would’ve grasped some of the subtleties of natural transformations without them.
We learn about two kinds of composition of natural transformations. The first combines two natural transformations on functors , say one and one to obtain a natural transformation . The other combines a natural transformation between functors with one between functors , to produce one, .
An equivalence relation – weaker than isomorphism – is introduced on categories as a functor such that there exists a functor and natural isomorphisms and . This seemed odd to me — the lack of insistence on isomorphism for equivalence — but the examples all make sense. More importantly it is shown that this requirement is sufficient for the on-homomorphism part of the functor to be bijective for every fixed pair of objects in , which feels right. I think my understanding could benefit from further interesting examples of usefully equivalent but non-isomorphic categories though – suggestions welcome!
Limits and Colimits
Chapter 6 is primarily given over to the definitions and discussion of limits and colimits, generalising the ideas in to other categories. The chapter begins by looking again at product and coproduct. As an example, for a category , with objects , a span is defined as a triple where is an object in and and are morphisms in . A product is then defined as as span on and such that for any other span on and , there exists a unique morphism such that the following diagram commutes. Again, I rather like the author’s slogan for this: “none shall map to and except through me!”
The chapter then further generalises these ideas to the the ideas of limit and colimit. This section was a step up in abstraction and also one of the least “applied” parts of the book. It has certainly piqued my interest – what are limits and colimits in various structures I might use / encounter in my work? – but I found the examples here fairly limited and abstract compared to the copious ones in previous chapters. It’s in this chapter that universal properties begin to rear their head in a major way, and again I’m still not totally intuitively comfortable with this idea – I can crank the handle and answer the exercises, but I don’t feel like I’m yet at home with the idea of universal properties.
Categories at Work
In the final section of the book, we first study adjunctions. Given categories and , an adjunction is defined as a pair of functors and together with a natural isomorphism whose component for , is . Several preceding results and examples can then be re-expressed in this framework, and understood as examples of adjunctions. Adjuctions are also revisited later in the chapter to show a close link with monads.
Categories of functors are discussed, leading up to a discussion of the Yoneda lemma. This is really interesting stuff, as it seems to cross huge boundaries of abstraction. Consider a category . Spivak first defines a functor sending objects in to the set and similarly for morphisms. This is then used in the following theorem:
Let be a category, be an object, and be a set-valued functor. There is a natural bijection .
This looks super weird: the right-hand side is just a plain old set and yet the left hand-side seems to be some tower of abstraction. I worked through a couple of examples I dreamt up, namely (the category with one object) with mapping that object to an arbitrary set, and with and convinced myself this works just fine for those two examples. Unfortunately, the general proof is not covered in the book – readers are referred to Mac Lane – and while some discussion is provided regarding links to databases, I felt like I need a proof to properly grasp what’s going on here under the hood.
Monads are then covered. I’ve briefly seen monads in a very practical context of functional programming, specifically for dealing with I/O in Haskell, but I’ve never looked at them from a theoretical perspective before.
A monad on (monads are only defined in the book on , for reasons not totally apparent to me) is defined as a triple consisting of a functor , a natural transformation , and natural transformation such that the following diagrams commute:
I found this discussion enlightening and relatively straight-forward. Firstly, exception monads are discussed, like Maybe in Haskell, and a nice example of wrapping sets and their functions into power sets and their functions, for which I again enjoyed the exercises. Kleisli categories of monads are defined, and it then becomes apparent why this is such a powerful formalism to think about concepts like program state and I/O.
The Kleisli category of a monad on is defined as a category with objects and morphisms . Identity is in is exactly in and composition of morphisms and in is given by in . You can intuitively see the ‘baggage’ of being carried through in yet hidden from view in .
Finally, the book ends with a (very) brief discussion of operads. I’ve not come across operads before, and from this brief discussion I’m not really sure what they ‘buy’ me beyond monoidal categories.
The exercises are copious and great, especially in the earlier chapters. They kept me engaged throughout and helped me to really understand what I was reading, rather than just think I had. In the later exercises there are many special cases that turn out to be more familiar things. This is both enlightening “oh, an X is just a Y!” and frustrating “yeah, but I want to see Xs that aren’t Ys, otherwise what’s this all heavyweight general machinery for?” I guess this comes from trying to cover a lot of ground in one textbook.
I am struck by just how much mental gymnastics I feel I had to go through to see mathematical structures from the “outside in” as well as from the “inside out”, and how category theory helped me do that. This is a pleasing feeling.
The slogans are great and often amusing, e.g. when describing a functor from the category of Monoids to the category of Sets, “you can use a smartphone as a paperweight.”
The book ends rather abruptly after the brief discussion of operads, with no conclusion. I felt it would be nice to have a summary discussion, and perhaps point out the current directions the field of applied category theory is taking.
Overall, this book has been an enlightening read. It has helped me to see the commonality behind several disparate mathematical structures, and has encouraged me to look for such commonality and relationships in my own future work. Recommended!
The first paper, led by Aaron, is entitled ‘Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs’, and can be found on arXiv here. This paper is a serious look at getting an automated CNN flow for FPGAs that makes good use of some of the arithmetic flexibility available on these devices. Powers-of-two (“free” multiplication) and fixed-point (“cheap” multiplication) are both leveraged.
The second paper, led by Ameer, looks at the computation of a set of approximate nearest neighbours. This is useful in a number of machine learning settings, both directly as a non-neural deep learning inference algorithm and indirectly within sophisticated deep learning algorithms like Neural Turing Machines. Ameer has shown that this task can be successfully accelerated in an FPGA design, and explores some interesting ways to parameterise the algorithm to make the most of the hardware, leading to tradeoffs between algorithm accuracy and performance.
If you’re at FPT this year, please go and say hello to Aaron, Ameer and Qiang.
This week, I attended my first Asilomar Conference on Circuits, Signals and Computers, a very long-running conference series of the IEEE Signal Processing Society, with a very broad range of topics. I decided to attend Asilomar after being invited to give not just one talk, but two, once by my friend and collaborator Miloš Ercegovac from UCLA, and once by my good colleague Zhiru Zhang from Cornell.
No discussion of highlights of Asilomar can go without pointing out the extraordinarily beautiful setting of a conference centre right on Asilomar Beach. I can certainly see why the conference organisers keep coming back year after year – since the 1970s for Miloš and even earlier for my old friend fred harris, who I met there by surprise.
The conference opened with distinguished lecture by Helmut Bölcskei from ETH Zurich, who gave a wonderful talk about the fundamental limits of deep learning. The key results he presented were about neural networks built of linear computational units and ReLU functions, and he showed how they can approximate a range of different functions. I was already familiar with asymptotic results for infinite depth or infinite width networks, but Bölcskei’s results were different – they showed how the approximation quality can be traded against a metric of neural network complexity that captured the number of bits needed to store the topology and the weights of the network. He was able to show the power of such neural networks across an extremely broad class of functions, and to explain how this comes about.
Compilation for Spatial Computing Architectures
This session was organised by Zhiru Zhang from Cornell and Hongbo Rong from Intel. The first talk, given by Yi-Hsiang Lai from Cornell, described the HeteroCL infrastructure, about which I’ve previously blogged in my description of FPGA 2019. Very closely related to this was Hongbo’s own work at Intel Labs, which makes heavy use of polyhedral methods, and work from the systolic array community on affine and uniform recurrence equations.
Miloš Ercegovac from UCLA and James Stine from Oklahoma State University looked at how digit iteration techniques for division compare to multiplication-based techniques. Alexander Groszewski and Earl Swartzlander from UT Austin discussed their results from deterministic unary arithmetic inspired by stochastic computing; Keshab Parhi from the audience raised the interesting point of the importance of preservation of temporal structure in specially designed deterministic sequences for purposes of compositionality.
I really enjoyed the unusual talk by Keshab Parhi (U. Minnesota) on Molecular Computing Inspired by Stochastic Logic (see here for more details) via Fractional Coding, building onSoloveichik, Seelig and Winfree. If digits are encoded as relative concentrations of molecules, the problem of signal correlation, which tends to take the shine off stochastic computing work, can be avoided. He proposed computation using molecular reaction rates, and showed how to encode values as concentrations of two different molecules; his techniques have been verified in simulation – I would love to see this in a test-tube.
There was a very enjoyable talk by Alessandro Achille from UCLA on studying deep neural networks from an information-theoretic perspective. He pointed out that real-valued weights appear to contain infinite information, but that by using the principle that small perturbations in weights should not throw-off the classification result completely, we can recover a finite weight encoding. He then moved on to show using a PAC-Bayes bound that good generalisation comes from low weight information. He demonstrated that Stochastic Gradient Descent implicitly minimises Fisher information, but that for generalisation performance, it is Shannon information that should be bounded – he then derived a connection between the two under some conditions.
Tom Goldstein (University of Maryland) gave a stunningly illustrated talk on Understanding Generalization in Neural Nets via Visualization, based on his co-authored paper on the topic. He sought to empirically understand how the continuous piecewise linear functions of modern DNNs, when combined with SGD-based optimisation, lead to functions that generalise well. This was done via a clever process of “poisoning” training data to obtain badly generalising minima.
This session was organised by Keshab Parhi (University of Minnesota.)
Danny Bankman gave a talk about Stanford’s RRAM-based DNNs. He showed that register-file access accounts for the majority of energy in standard CMOS processor-like architectures, and drew the conclusion that architectures should be “memory-like” in their design, using “conductance-mode arithmetic” with very low precision integer activations, and put the necessary ramp generator for their ADC right inside the RRAM array. Results are verified using SPICE. I know little about RRAM technology, but talking with my colleagues Themis Prodromakis and Tony Kenyon has got me intrigued.
Deep Learning Theory
This session was organised by Tom Goldstein (University of Maryland.)
My favourite talk in this session was by Tom himself, in which he presented an analysis of adversarial attacks in DNNs, again beautifully illustrated – based on his co-authored paper. He showed that due to the high dimensionality of the spaces involved, you are extremely likely to hit – at random – a point in the input space that can be adversarially perturbed. He demonstrated – using the audience as guinea pigs – that adversarial perturbation can also trick humans quite easily on the CIFAR-10 data set. Perhaps my favourite twist on the talk was that he gave the talk wearing an “invisibility cloak” which – if worn – tricks YOLO into not identifying the wearer.
Reflections on Asilomar
I’ve sent PhD students to Asilomar before, but this was the first time I attended myself. It’s a very broad conference, in a beautiful setting. It seems to be a great venue to complement the more technically homogeneous conferences like FPGA which I help to organise – they serve different purposes. Asilomar is a great conference to have your work seen by people who wouldn’t usually follow your work, and to pick up ideas from neighbouring fields.
I’ve been interested in approximation, and how it can be used to save resources, ever since my PhD 20 years ago, where I coined the term “lossy synthesis” to mean the synthesis of a circuit / program where error can be judiciously introduced in order to effect an improvement in performance or silicon area. Recently, this area of research has become known as “approximate computing“, and a bewildering number of ways of approximating behaviour – at the circuit and software level – have been introduced.
Some of the existing approaches for approximate circuit synthesis are point solutions for particular IP cores (e.g. our approximate multiplier work) or involve moving beyond standard digital design methodologies (e.g. our overclocking work.) However, a few pieces of work develop a systematic method for arbitrary circuits, and Ilaria’s work falls into this category.
Essentially, she studies that class of approximation that can be induced solely by removing chunks of a logic circuit, replacing dangling nets with constant values – a technique my co-authors referred to as Circuit Carving in their DATE 2018 paper.
Our DAC paper presents a methodology for bounding the error that can be induced by performing such an operation. Such error can be bounded by exhaustive simulation or SAT, but not for large circuits with many inputs due to scalability concerns. On the other hand, coarse bounds for the error can be derived very quickly. Ilaria’s work neatly explores the space between these two extremes, allowing analysis execution time to be traded for bound quality in a natural way.
Approximation’s time has definitely come, with acceptance in the current era often driven by machine-learning applications, as I explore in a previous blog post. Ilaria’s paper is an interesting and general approach to the circuit-level problem.
In this paper, we take a very unusual approach to the design of a deep neural network accelerator in hardware: for us, the nodes in the neural network are Boolean lookup tables.
We were motivated initially by the fact that in very low precision FPGA neural network architectures, lookup tables are often used for arithmetic, but they are often used for very specific functions: while a -LUT is capable of implementing any nonlinear Boolean function with inputs, it ends up getting used for only a tiny fraction of these functions. A good example is binarised neural networks (BNNs) such as FINN, where LUTs end up being used to implement XNOR gates (multiplication over ) and popcount functions. Our research question is therefore: rather than restricting ourselves to these functions, can we make better use of the LUTs by embracing the nonlinearity and the -input support they give us?
We show that this is indeed possible. Our basic approach is to start with a weight-binarised neural network, add inputs to each node to bring them up to support, and then retrain the Boolean function implemented by that node. Retraining Boolean functions is a bit tricky, of course, because neural network training algorithms are not designed for this purpose. We generate a smooth interpolating function over the LUT entries, allowing us to use standard neural network training software (we use TensorFlow).
The end result is that the re-trained neural network is far more prunable than the original, because the extra inputs to the -LUTs compensate for the removal of other nodes. Thus we end up with a much sparser neural network for the same classification accuracy. The sparsity improves our area by a factor of two or more, yet the more complex inference functions at each node are effectively provided “for free” by the FPGA architecture.
I very much enjoyed the format of the meeting: select interesting speakers and allow them 30 minutes to talk about a topic of their choosing related to the theme of the meeting, with 10 full minutes for discussion and in-depth questions after each talk. Posters were also presented by a wide variety of researchers, with each poster presenter given a one-minute lightning-talk slot. Two of my PhD students, Erwei Wang and He Li, took this opportunity. Erwei presented a preview of our LUTNet paper appearing at FCCM very soon (separate blog post to follow), while He presented some of our work on arbitrary precision iterative compute.
Talks by others included:
David Keyes (KAUST) on the topic “Hierarchical algorithms on hierarchical architectures”. He discussed some very interesting hierarchical low-rank decompositions and also hierarchies of numerical precision.
Kathy Yelick (Berkeley) spoke on “Antisocial parallelism: avoiding, hiding and managing communication”, a very fruitful area of research in recent years. A few years ago, Abid Rafique, one of my former PhD students (joint with Nachiket Kapre) made use of this work, and it was good to catch up with the current state of research.
Anna Scaife (Manchester) gave a fascinating insight into the Square Kilometre Array. The sheer volumes of data are mind boggling (zettabytes annually) and pose unique algorithmic challenges.
Michela Taufer (UTK) discussed molecular dynamics workflows, and how we may be able to harness machine learning to reduce the human bottlenecks in such workflows in the future.
Rick Stevens (Argonne) gave a very engaging talk about the intersection of machine learning with computational science, exemplified by the Candle project, using deep learning in cancer research. He mentioned many of the emerging architectures for deep learning and their optimisation for low-precision compute. Interested readers may also like our recent survey article on the topic.
John Shalf (LBNL) spoke about alternative computational models beyond CMOS, new materials for switches, and the growth of hardware specialisation. He proposed the strategy of: hardware-driven algorithm design, algorithm-driven hardware design, and co-design of hardware and algorithm. Having worked in the FPGA community for decades where this has been our mantra, it is great to see hardware specialisation spreading the message of co-design into the HPC community.
Doug Kothe (ORNL) provided a very interesting insight into exascale computational motifs at the DOE.
Tony Hey (STFC) set out a compelling argument that academics in applied deep learning should focus on deep learning for scientific data, on the basis that (i) scientific data sets are huge and open and (ii) head-to-head competition with industrial giants on profit-oriented deep-learning applications, without access to their data sets, is a poor choice for academia. I think the same argument could be made for academic computer architecture too. His team are developing benchmarks for scientific machine learning, complementary to MLPerf.
Erin Carson (Charles University) presented an enchanting talk on iterative linear algebra in low and multiple precision compute. I’m a fan of her earlier work with Nick, and it was great to hear her current thinking and discussion of least-squares iterative refinement.
Steve Furber (Manchester) spoke about arithmetic in the context of the SpiNNaker machine, and a particular approach they have taken to numerical solution of neural ODEs in fixed-point arithmetic, demonstrating that stochastic rounding can radically improve the quality of their results.
Tim Palmer (Oxford) argued for low precision compute in weather and climate models, allowing the recouped computational cost to be recycled into better resolution weather models, resulting in higher overall accuracy. This reminded me of the argument I made with my PhD student Antonio Roldão Lopes and collaborator Eric Kerrigan in our paper More FLOPS or More Precision?
Guillaume Aupy (INRIA) discussed memory-efficient approaches for automatic differentiation and back-propagation.
Satoshi Matsuoka (RIKEN Centre) took us through the work being done on Post-K, a new Japanese supercomputer being designed to provide compute infrastructure for future workloads at the intersection of big data and AI/ML.
Mike Heroux (Sandia) spoke about his work developing programming infrastructure for future HPC, in particular for performance portability and for system reliability.
My own talk was entitled “Rethinking Deep Learning: Architectures and Algorithms” – I will save summarising the content for a future blog post. Slides for all these talks will appear on the Manchester Numerical Linear Algebra group website. In addition, each speaker has received an invitation to author an article for a special issue of Philosophical Transactions A – this should be a very interesting read.
I was impressed by the great attendance at the meeting and by the quality of the technical interaction; I met several new and interesting people at the intersection of numerical analysis and scientific computing.
Readers can find a summary of some of the talks I found most interesting below.
On Tuesday, my colleague Martin Trefzer chaired a session on Computational and Resource-efficiency in Quantum and Approximate Computing. The work by Sekanina was interesting, using information on data distribution to drive the construction of approximate circuits. Circuits are constructed from the baseline using a technique called Cartesian Genetic Programming. I have recently been collaborating with Ilaria Scarabottolo and others from Laura Pozzi‘s group on a related problem – see our DAC 2019 paper (to appear) for details – so this was of particular interest.
On Wednesday, I chaired a session When Approximation Meets Dependability, together with my colleague Rishad Shafik. Ioannis Tsiokanos from Queen’s University Belfast presented an interesting approach that dynamically truncates precision in order to avoid timing violations. This is interestingly complementary to the approach I developed with my former PhD student Kan Shi where we first simply allowed timing violations [FCCM 2013], and secondly redesigned data representation based on Ercegovac‘s online arithmetic, so that timing violations caused low-magnitude error [DAC 2014].
On Thursday morning, I attended the session Architectures for Emerging Machine Learning Techniques. Interestingly, there was a paper there making use of Gustafson’s posits within hardware-accelerated deep learning, a technique they dub Deep Positron.
The highlight talk for me was Ed Lee‘s Thursday keynote, A Fundamental Look at Models and Intelligence. Although I’ve been aware of Lee’s work, especially on Ptolemy, since I did my own PhD, I don’t think I’ve ever had the pleasure of hearing him lecture before. It was insightful and entertaining. A central theme of the talk was that models mean two different things for scientists and engineers: a scientist builds a model to correspond closely to a ‘thing’; an engineer builds a ‘thing’ to correspond closely to a model. He used dichotomy to illuminate some of the differences we see between neuroscience-inspired artificial intelligence and the kind of AI we see as very popular at the moment, such as deep learning. Lee’s general-readership book Plato and the Nerd – which has been on my “to read” list since my colleague and friend Steve Neuendorffer mentioned it to me a few years ago – has just climbed several notches up that list!
On Thursday afternoon, I attended the session on The Art of Synthesizing Logic. My favourite talk in this session was from Heinz Reiner, who presented a collaboration between EPFL and UC Berkeley on Boolean rewriting for logic synthesis, in which exact synthesis methods are used to replace circuit cuts, rather than resorting to a pre-computed database of optimal function implementations. During the talk, Reiner also pointed the audience to an impressive-looking GitHub repo, featuring what looks like some very useful tools.
Friday is always workshop day at DATE, featuring a number of satellite workshops. I attended the workshop entitled Quo Vadis, Logic Synthesis?, organised by Tiziano Villa and Luca Carloni. This was a one-off workshop in celebration of the 35th anniversary of the publication of the influential Espresso book on two-level logic minimisation.
My favourite talk in this workshop was from Jordi Cortadella, who spoke about a method for synthesising Boolean relations. This is the problem of synthesising a the cheapest implementation of a function , which one is free to choose from amongst the given relation , i.e. viewing as a relation , one is free to chose – and its implementation – subject to the requirement that for some given relation (not necessarily a function). This is a strict generalisation of the well-known problem of Boolean ‘don’t care’ conditions, a.k.a. incompletely specified functions. Cortadella presented a method leveraging the known approaches to this latter problem, by exploring the semi-lattice of relations between these sets generated by in a structured way, using a form of branch-and-bound.
Soeken also presented a very interesting summary of three uses of SAT within logic synthesis, namely Schmitt’s ASP-DAC paper on SAT-based LUT mapping, Eén’s paper on using logic synthesis for efficient SAT (rather than SAT for efficient logic synthesis) and Haaswijk‘s recent PhD on making exact logic synthesis more scalable by providing partial topological information – a topic that interestingly has some echoes in work I’m soon to present at the Royal Society.
The workshop was an enjoyable way to end DATE, and I was disappointed to have to leave half-way through – there may well have been other interesting talks presented in the afternoon.
This was my first SIAM conference. I was kindly invited to speak on the topic of floating-point error analysis by Pierre Blanchard, Nick Higham and Theo Mary. I very much enjoyed the sessions they organised and indeed the CSE conference, which I hope to be able to attend more regularly from now on.
My own talk was entitled Approximate Arithmetic – A Hardware perspective. I spoke about the rise of architecture specialisation as driving the need for closer collaboration between computer architects and numerical analysts, about some of our work on automatic error bounds Boland and Constantinides (2011) and Magron, Constantinides and Donaldson (2017), on code refactoring Gao and Constantinides (2015), as well as some of our most recent work on machine learning (I will blog separately about this latter topic over the next couple of months.)
The CSE conference is very large – with 30-40 small parallel sessions happening at any given moment – so I cannot begin to summarise the conference. However, I include some notes below on other talks I found particularly interesting.
I very much enjoyed the plenary presentation by Rachel Ward on Stochastic Gradient Descent (SGD) in Theory and Practice. She introduced the SGD method very nicely, and looked at various assumptions for convergence. She took a particularly illuminating approach, by looking at applying SGD to the simple special case of solving a system of linear equations by minimising in the case where . She showed that if the system is under-determined, then SGD converges to the solution of minimum 2-norm, and therefore has an inherent regularising effect. I was surprised by some of the results on overparameterised neural networks, showing that SGD finds global minimisers and that there really doesn’t tend to be much overfitting despite the huge number of parameters, pointing to the implicit regularisation caused by the SGD algorithm itself. I learnt a lot from this talk, and have several papers on my “to read” list as a result, in particular:
There was also an interesting plenary from Anima Anandkumar on the role of tensors in machine learning. The mathematical structure of tensors and multi-linear algebra are topics I’ve not explored before – mainly because I’ve not seen the need to spend time on them. Anandkumar certainly provided me with motivation to do that!
Floating-Point Error Analysis
Theo Mary from the University of Manchester gave a very good presentation of his work with Nick Higham on probabilistic rounding error analysis, treating numerical roundoff errors as zero-mean independent random variables of arbitrary distribution, making use of Hoeffding’s inequality to a produce a backward error analysis. Their work is described in more detail on their own blog post and – in more depth – in their their very interesting paper. It’s a really exciting and useful direction, I think, given the greater emphasis on average-case performance from modern applications, together with both very large data sets and very low precision computation, the combination of which renders many worst-case analyses meaningless. In a similar vein, Ilse Ipsen also presented a very interesting approach: a forward error analysis, more specialised in that she only looked at inner products, but also without the assumption of independence, making use of Azuma’s inequality. The paper on this topic has not yet been finished, but I certainly look forward to reading it in due course!
Reducing Communication Costs
There were a number of interesting talks on mitigating communication costs. Lawrence Livermore National Labs presented several papers relating to the ZFP format they’ve recently proposed for (lossily) compressed floating-point vectors, at a mini-symposium organised by Alyson Fox, Jeffrey Hittinger, and James Diffenderfer. Diffenderfer’s talk developed a bound on the norm-wise relative error of vectors reconstructed from ZFP; Alyson Fox’s talk then extended this to the setting of iterative methods, noting as future work their interest in probabilistic analyses. In the same session, Nick Higham gave a crystal clear and well-motivated talk on his recent work with Srikara Pranesh and Mawussi Zunon – slides and paper are available. This work extends the applicability of Nick’s earlier work with Erin Carson to cases that would have over- or under-flowed, or led to subnormal numbers, without the scaling technique developed and analysed here. They use matrix equilibration – this reminded me of some work I did with my former PhD student Juan Jerez and colleague Eric Kerrigan, but in our case for a different algorithm kernel and targeting fixed-point arithmetic, where making use of the full dynamic range is particularly important. The Higham, Pranesh and Zunon results are both interesting and practically very useful.
In a different session, Hartwig Anzt spoke about the work he and others have been doing to explicitly decouple storage precision from compute precision in sparse linear algebra. The idea is simple but effective: take the high-order bits of the mantissa (and the sign / exponent) and store them in one chunk of data and – separately – store the low-order bits in another chunk. Perform all arithmetic in high precision (because it’s not the computation that’s the bottleneck), but convert low-precision stored data to high precision on the fly at data load (e.g. by packing low-order bits with zeros.) Then, at run-time, decide whether to load the full-precision data or only the low-precision data, based on current estimates of convergence. This approach could also make a good case study application for the run-time adaptation methodology we developed with U. Southampton in the PRiME project.
Beyond the technical talks, there were two things that stood out for me since I’m new to the conference. Firstly, there were many more women than in the typical engineering conferences I attend. I don’t know whether the statistics on maths versus engineering are in line with this observation, but clearly maths is doing something right from which we could learn. Secondly, there were clear sessions devoted to community building: mentoring sessions, tutorials for new research students, SIAM student chapter presentations, early career panels, presentations on funding programmes, diversity and inclusion sessions, a session on helping people improve their CV, an explicit careers fair, etc. Partly this may simply reflect the size of the conference, but even so, this seems to be something SIAM does particularly well.
This week, I attended the ACM FPGA 2019 conference in Seaside (nr. Monterey), California, the annual premier ACM event on FPGAs and associated technology. I’ve been involved in this conference for many years, as author, TPC member, TPC and general chair, and now steering committee member. Fashions have come and gone over this time, including in the applications of FPGA technology, but the programme at FPGA is always interesting and high quality. This year particular thanks should go to Steve Neuendorffer for organising the conference programme and to Kia Bazargan in his role as General Chair.
Below, I summarise my personal highlights of the conference. These are by no means my view of the “best” papers – they are all good – but rather those that interested me the most.
Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity, a collaboration between Tsinghua, Beihang, Harbin Institute of Technology, and Microsoft Research, tackled the problem of ensuring that an inference implementation, when sparsified, gets sparsified in a way that leads to balanced load across the various memory banks. The idea is simple but effective, and leads to an interesting tradeoff between the quality of LSTM output and performance. I think it would be interesting to try to design a training method / regulariser that encourages this kind of structured sparsity in the first place.
Kees Vissers from Xilinx presented a keynote talk summarising their new Versal architecture, which the Imperial team had previously had the pleasure of hearing about from our alumnus Sam Bayliss. This is a really very different architecture to standard FPGA fare, and readers might well be interested in taking a look at Kees’s slides to learn more.
Vaughn Betz presented a paper from the University of Toronto, Math Doesn’t Have to be Hard: Logic Block Architectures to Enhance Low Precision Multiply-Accumulate on FPGAs. This work proposed a number of relatively minor tweaks to Intel FPGA architectures which might have a signifiant impact on low-precision MAC performance. Vaughn began by pointing out that in this application, very general LUTs often get wasted by being used as very simple gates – he gave the example of AND gates in partial product generation, and even as buffers. A number of architectural proposals were made to avoid this issue. I find this particularly interesting at the moment, because together with my PhD student Erwei Wang and others, I have proposed a new neural network architecture called LUTNet, motivated by exactly the same concern. However, our approach is the dual of that presented by Vaughn – we keep the FPGA architecture constant but modify the basic computations performed by the neural network to be more well-tuned to the underlying architecture. Expect a future blog post on our approach!
Lana Josipović presented the most recent work on the dynamically scheduled HLS tool from Paolo Ienne‘s group at EPFL, which they first presented at last year’s conference – see my blog post from last year. This time they have added speculative execution to their armoury. This is a very interesting line of work as HLS moves to encompass more and more complex algorithns, and Lana did a great job illustrating how it works.
The panel discussion – chaired by Deming Chen – was on the topic of whether FPGAs have a role to play in Supercomputing. As I pointed out in the discussion, to answer this question scientifically we need to have a working definition of “FPGA” and of “Supercomputing” – both seem to be on shifting sands at the moment, and we need to resist reducing a question like this to “does LINPACK run well on a Virtex or Stratix device.”
We also had the pleasure of congratulating Deming Chen and Paul Chow on their recently awarded fellowships, awarding a best paper prize, recognising several historical FPGA papers of significance, and last but by no means least welcoming the new baby of two of the stalwarts of the FPGA community – baby complete with “I am into FPGA” T-shirt! All this led to an excellent community feeling, which we should continue to nurture.