Boolean Circuits are Neural Networks

On Monday, my PhD student Erwei Wang will present our work (joint also with James Davis and Peter Cheung) called LUTNet: Rethinking Inference in FPGA Soft Logic at the IEEE International Symposium on Field-Programmable Custom Computing Machines in San Diego, California.

In this paper, we take a very unusual approach to the design of a deep neural network accelerator in hardware: for us, the nodes in the neural network are Boolean lookup tables.

We were motivated initially by the fact that in very low precision FPGA neural network architectures, lookup tables are often used for arithmetic, but they are often used for very specific functions: while a K-LUT is capable of implementing any nonlinear Boolean function with K inputs, it ends up getting used for only a tiny fraction of these 2^{2^K} functions. A good example is binarised neural networks (BNNs) such as FINN, where LUTs end up being used to implement XNOR gates (multiplication over \{-1,+1\}) and popcount functions. Our research question is therefore: rather than restricting ourselves to these functions, can we make better use of the LUTs by embracing the nonlinearity and the K-input support they give us?

We show that this is indeed possible. Our basic approach is to start with a weight-binarised neural network, add inputs to each node to bring them up to K support, and then retrain the Boolean function implemented by that node. Retraining Boolean functions is a bit tricky, of course, because neural network training algorithms are not designed for this purpose. We generate a smooth interpolating function over the LUT entries, allowing us to use standard neural network training software (we use TensorFlow).

The end result is that the re-trained neural network is far more prunable than the original, because the extra inputs to the K-LUTs compensate for the removal of other nodes. Thus we end up with a much sparser neural network for the same classification accuracy. The sparsity improves our area by a factor of two or more, yet the more complex inference functions at each node are effectively provided “for free” by the FPGA architecture.

Circuit netlist? Neural network? Same thing!

Royal Society Discussion Meeting

I was kindly invited to speak at a Royal Society Discussion Meeting before Easter, entitled “Numerical Algorithms for High-Performance Computational Science”, organised by Nick Higham, Laura Grigori and Jack Dongarra. This blog post summarises some of the discussion at the meeting.

Prof Nick Higham kicking off the proceedings

I very much enjoyed the format of the meeting: select interesting speakers and allow them 30 minutes to talk about a topic of their choosing related to the theme of the meeting, with 10 full minutes for discussion and in-depth questions after each talk. Posters were also presented by a wide variety of researchers, with each poster presenter given a one-minute lightning-talk slot. Two of my PhD students, Erwei Wang and He Li, took this opportunity. Erwei presented a preview of our LUTNet paper appearing at FCCM very soon (separate blog post to follow), while He presented some of our work on arbitrary precision iterative compute.


Talks by others included:

  • David Keyes (KAUST) on the topic “Hierarchical algorithms on hierarchical architectures”. He discussed some very interesting hierarchical low-rank decompositions and also hierarchies of numerical precision.
  • Kathy Yelick (Berkeley) spoke on “Antisocial parallelism: avoiding, hiding and managing communication”, a very fruitful area of research in recent years. A few years ago, Abid Rafique, one of my former PhD students (joint with Nachiket Kapre) made use of this work, and it was good to catch up with the current state of research.
  • Anna Scaife (Manchester) gave a fascinating insight into the Square Kilometre Array. The sheer volumes of data are mind boggling (zettabytes annually) and pose unique algorithmic challenges.
  • Michela Taufer (UTK) discussed molecular dynamics workflows, and how we may be able to harness machine learning to reduce the human bottlenecks in such workflows in the future.
  • Rick Stevens (Argonne) gave a very engaging talk about the intersection of machine learning with computational science, exemplified by the Candle project, using deep learning in cancer research. He mentioned many of the emerging architectures for deep learning and their optimisation for low-precision compute. Interested readers may also like our recent survey article on the topic.
  • Jack Poulson (Hodge Star) spoke about sampling Determinantal Point Processes, and how links to matrix decomposition algorithms can be used to radically accelerate this process.
  • John Shalf (LBNL) spoke about alternative computational models beyond CMOS, new materials for switches, and the growth of hardware specialisation. He proposed the strategy of: hardware-driven algorithm design, algorithm-driven hardware design, and co-design of hardware and algorithm. Having worked in the FPGA community for decades where this has been our mantra, it is great to see hardware specialisation spreading the message of co-design into the HPC community.
  • Doug Kothe (ORNL) provided a very interesting insight into exascale computational motifs at the DOE.
  • Tony Hey (STFC) set out a compelling argument that academics in applied deep learning should focus on deep learning for scientific data, on the basis that (i) scientific data sets are huge and open and (ii) head-to-head competition with industrial giants on profit-oriented deep-learning applications, without access to their data sets, is a poor choice for academia. I think the same argument could be made for academic computer architecture too. His team are developing benchmarks for scientific machine learning, complementary to MLPerf.
  • Erin Carson (Charles University) presented an enchanting talk on iterative linear algebra in low and multiple precision compute. I’m a fan of her earlier work with Nick, and it was great to hear her current thinking and discussion of least-squares iterative refinement.
  • Steve Furber (Manchester) spoke about arithmetic in the context of the SpiNNaker machine, and a particular approach they have taken to numerical solution of neural ODEs in fixed-point arithmetic, demonstrating that stochastic rounding can radically improve the quality of their results.
  • Tim Palmer (Oxford) argued for low precision compute in weather and climate models, allowing the recouped computational cost to be recycled into better resolution weather models, resulting in higher overall accuracy. This reminded me of the argument I made with my PhD student Antonio Roldão Lopes and collaborator Eric Kerrigan in our paper More FLOPS or More Precision?
  • Guillaume Aupy (INRIA) discussed memory-efficient approaches for automatic differentiation and back-propagation.
  • Satoshi Matsuoka (RIKEN Centre) took us through the work being done on Post-K, a new Japanese supercomputer being designed to provide compute infrastructure for future workloads at the intersection of big data and AI/ML.
  • Mike Heroux (Sandia) spoke about his work developing programming infrastructure for future HPC, in particular for performance portability and for system reliability.

My own talk was entitled “Rethinking Deep Learning: Architectures and Algorithms” – I will save summarising the content for a future blog post. Slides for all these talks will appear on the Manchester Numerical Linear Algebra group website. In addition, each speaker has received an invitation to author an article for a special issue of Philosophical Transactions A – this should be a very interesting read.

D3tHpjbX4AAWM53.jpg large
My talk on “Rethinking Deep Learning: Architectures and Algorithms”

I was impressed by the great attendance at the meeting and by the quality of the technical interaction; I met several new and interesting people at the intersection of numerical analysis and scientific computing.

Special thanks to the organisers, Nick, Laura, and Jack for putting an excellent programme together. And congratulations to Jack for the news – a few days after the meeting – of his election to Foreign Member of the Royal Society!

The Imperial EEE Team at the Royal Society

DATE 2019: Some Highlights

This week, I attended the Design, Automation & Test in Europe (DATE) conference in Florence, Italy. DATE is a large conference, which I have attended irregularly since I was a PhD student. This year, the general chair was a long-standing colleague from the FPGA community, Jürgen Teich.

Readers can find a summary of some of the talks I found most interesting below.

On Tuesday, my colleague Martin Trefzer chaired a session on Computational and Resource-efficiency in Quantum and Approximate Computing. The work by Sekanina was  interesting, using information on data distribution to drive the construction of approximate circuits. Circuits are constructed from the baseline using a technique called Cartesian Genetic Programming. I have recently been collaborating with Ilaria Scarabottolo and others from Laura Pozzi‘s group on a related problem – see our DAC 2019 paper (to appear) for details – so this was of particular interest.

On Wednesday, I chaired a session When Approximation Meets Dependability, together with my colleague Rishad Shafik. Ioannis Tsiokanos from Queen’s University Belfast presented an interesting approach that dynamically truncates precision in order to avoid timing violations. This is interestingly complementary to the approach I developed with my former PhD student Kan Shi where we first simply allowed timing violations [FCCM 2013], and secondly redesigned data representation based on Ercegovac‘s online arithmetic, so that timing violations caused low-magnitude error [DAC 2014].

David Pellerin from Amazon gave an interesting keynote address which very heavily emphasised Amazon’s F1 FPGA offering, which was – of course – music to my ears.

On Thursday morning, I attended the session Architectures for Emerging Machine Learning Techniques. Interestingly, there was a paper there making use of Gustafson’s posits within hardware-accelerated deep learning, a technique they dub Deep Positron.

The highlight talk for me was Ed Lee‘s Thursday keynote, A Fundamental Look at Models and Intelligence. Although I’ve been aware of Lee’s work, especially on Ptolemy, since I did my own PhD, I don’t think I’ve ever had the pleasure of hearing him lecture before. It was insightful and entertaining. A central theme of the talk was that models mean two different things for scientists and engineers: a scientist builds a model to correspond closely to a ‘thing’; an engineer builds a ‘thing’ to correspond closely to a model. He used dichotomy to illuminate some of the differences we see between neuroscience-inspired artificial intelligence and the kind of AI we see as very popular at the moment, such as deep learning. Lee’s general-readership book Plato and the Nerd – which has been on my “to read” list since my colleague and friend Steve Neuendorffer mentioned it to me a few years ago – has just climbed several notches up that list!

On Thursday afternoon, I attended the session on The Art of Synthesizing Logic. My favourite talk in this session was from Heinz Reiner, who presented a collaboration between EPFL and UC Berkeley on Boolean rewriting for logic synthesis, in which exact synthesis methods are used to replace circuit cuts, rather than resorting to a pre-computed database of optimal function implementations. During the talk, Reiner also pointed the audience to an impressive-looking GitHub repo, featuring what looks like some very useful tools.

Friday is always workshop day at DATE, featuring a number of satellite workshops.  I attended the workshop entitled Quo Vadis, Logic Synthesis?, organised by Tiziano Villa and Luca Carloni. This was a one-off workshop in celebration of the 35th anniversary of the publication of the influential Espresso book on two-level logic minimisation.

Villa talked the audience through the history of logic synthesis, starting with the Quine-McCluskey method.

My favourite talk in this workshop was from Jordi Cortadella, who spoke about a method for synthesising Boolean relations. This is the problem of synthesising a the cheapest implementation of a function f : {\mathbb B}^n \to {\mathbb B}^m, which one is free to choose from amongst the given relation R, i.e. viewing f as a relation f \subset {\mathbb B}^n \times {\mathbb B}^m, one is free to chose f – and its implementation – subject to the requirement that f \subseteq R for some given relation R (not necessarily a function). This is a strict generalisation of the well-known problem of Boolean ‘don’t care’ conditions, a.k.a. incompletely specified functions. Cortadella presented a method leveraging the known approaches to this latter problem, by exploring the semi-lattice of relations between these sets generated by \subseteq in a structured way, using a form of branch-and-bound.

Soeken also presented a very interesting summary of three uses of SAT within logic synthesis, namely Schmitt’s ASP-DAC paper on SAT-based LUT mapping, Eén’s paper on using logic synthesis for efficient SAT (rather than SAT for efficient logic synthesis) and Haaswijk‘s recent PhD on making exact logic synthesis more scalable by providing partial topological information – a topic that interestingly has some echoes in work I’m soon to present at the Royal Society.

The workshop was an enjoyable way to end DATE, and I was disappointed to have to leave half-way through – there may well have been other interesting talks presented in the afternoon.

Highlights of CSE 2019

Over the second half of this week, I’ve been attending the SIAM Computational Science and Engineering conference in Spokane, Washington – a short flight north (and a radical change in weather) from my earlier conference in California this week.

Spokane, WA in February. Temperatures were as low as -12℃.

This was my first SIAM conference. I was kindly invited to speak on the topic of floating-point error analysis by Pierre Blanchard, Nick Higham and Theo Mary. I very much enjoyed the sessions they organised and indeed the CSE conference, which I hope to be able to attend more regularly from now on.

My own talk was entitled Approximate Arithmetic – A Hardware perspective. I spoke about the rise of architecture specialisation as driving the need for closer collaboration between computer architects and numerical analysts, about some of our work on automatic error bounds Boland and Constantinides (2011) and Magron, Constantinides and Donaldson (2017), on code refactoring Gao and Constantinides (2015), as well as some of our most recent work on machine learning (I will blog separately about this latter topic over the next couple of months.)

The CSE conference is very large – with 30-40 small parallel sessions happening at any given moment – so I cannot begin to summarise the conference. However, I include some notes below on other talks I found particularly interesting.

Plenary Sessions

I very much enjoyed the plenary presentation by Rachel Ward on Stochastic Gradient Descent (SGD) in Theory and Practice. She introduced the SGD method very nicely, and looked at various assumptions for convergence. She took a particularly illuminating approach, by looking at applying SGD to the simple special case of solving a system of linear equations by minimising F(w) = \frac{1}{2}||Aw-b||^2 in the case where \exists w^*. Aw^* = b. She showed that if the system is under-determined, then SGD converges to the solution of minimum 2-norm, and therefore has an inherent regularising effect. I was surprised by some of the results on overparameterised neural networks, showing that SGD finds global minimisers and that there really doesn’t tend to be much overfitting despite the huge number of parameters, pointing to the implicit regularisation caused by the SGD algorithm itself. I learnt a lot from this talk, and have several papers on my “to read” list as a result, in particular:

There was also an interesting plenary from Anima Anandkumar on the role of tensors in machine learning. The mathematical structure of tensors and multi-linear algebra are topics I’ve not explored before – mainly because I’ve not seen the need to spend time on them. Anandkumar certainly provided me with motivation to do that!

Floating-Point Error Analysis

Theo Mary from the University of Manchester gave a very good presentation of his work with Nick Higham on probabilistic rounding error analysis, treating numerical roundoff errors as zero-mean independent random variables of arbitrary distribution, making use of Hoeffding’s inequality to a produce a backward error analysis. Their work is described in more detail on their own blog post and – in more depth – in their their very interesting paper. It’s a really exciting and useful direction, I think, given the greater emphasis on average-case performance from modern applications, together with both very large data sets and very low precision computation, the combination of which renders many worst-case analyses meaningless. In a similar vein, Ilse Ipsen also presented a very interesting approach: a forward error analysis, more specialised in that she only looked at inner products, but also without the assumption of independence, making use of Azuma’s inequality. The paper on this topic has not yet been finished, but I certainly look forward to reading it in due course!

Reducing Communication Costs

There were a number of interesting talks on mitigating communication costs. Lawrence Livermore National Labs presented several papers relating to the ZFP format they’ve recently proposed for (lossily) compressed floating-point vectors, at a mini-symposium organised by Alyson Fox, Jeffrey Hittinger, and James Diffenderfer. Diffenderfer’s talk developed a bound on the norm-wise relative error of vectors reconstructed from ZFP; Alyson Fox’s talk then extended this to the setting of iterative methods, noting as future work their interest in probabilistic analyses. In the same session, Nick Higham gave a crystal clear and well-motivated talk on his recent work with Srikara Pranesh and Mawussi Zunonslides and paper are available. This work extends the applicability of Nick’s earlier work with Erin Carson to cases that would have over- or under-flowed, or led to subnormal numbers, without the scaling technique developed and analysed here. They use matrix equilibration – this reminded me of some work I did with my former PhD student Juan Jerez and colleague Eric Kerrigan, but in our case for a different algorithm kernel and targeting fixed-point arithmetic, where making use of the full dynamic range is particularly important. The Higham, Pranesh and Zunon results are both interesting and practically very useful.

In a different session, Hartwig Anzt spoke about the work he and others have been doing to explicitly decouple storage precision from compute precision in sparse linear algebra. The idea is simple but effective: take the high-order bits of the mantissa (and the sign / exponent) and store them in one chunk of data and – separately – store the low-order bits in another chunk. Perform all arithmetic in high precision (because it’s not the computation that’s the bottleneck), but convert low-precision stored data to high precision on the fly at data load (e.g. by packing low-order bits with zeros.) Then, at run-time, decide whether to load the full-precision data or only the low-precision data, based on current estimates of convergence. This approach could also make a good case study application for the run-time adaptation methodology we developed with U. Southampton in the PRiME project.

A Reflection

Beyond the technical talks, there were two things that stood out for me since I’m new to the conference. Firstly, there were many more women than in the typical engineering conferences I attend. I don’t know whether the statistics on maths versus engineering are in line with this observation, but clearly maths is doing something right from which we could learn. Secondly, there were clear sessions devoted to community building: mentoring sessions, tutorials for new research students, SIAM student chapter presentations, early career panels, presentations on funding programmes, diversity and inclusion sessions, a session on helping people improve their CV, an explicit careers fair, etc. Partly this may simply reflect the size of the conference, but even so, this seems to be something SIAM does particularly well.

Highlights of FPGA 2019

FPGA 2019 - ICL Group photo
Current and Former Imperial Staff, Students, and Sabbatical Visitors at FPGA 2019

This week, I attended the ACM FPGA 2019 conference in Seaside (nr. Monterey), California, the annual premier ACM event on FPGAs and associated technology. I’ve been involved in this conference for many years, as author, TPC member, TPC and general chair, and now steering committee member. Fashions have come and gone over this time, including in the applications of FPGA technology, but the programme at FPGA is always interesting and high quality. This year particular thanks should go to Steve Neuendorffer for organising the conference programme and to Kia Bazargan in his role as General Chair.

Below, I summarise my personal highlights of the conference. These are by no means my view of the “best” papers – they are all good – but rather those that interested me the most.

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity, a collaboration between Tsinghua, Beihang, Harbin Institute of Technology, and Microsoft Research, tackled the problem of ensuring that an inference implementation, when sparsified, gets sparsified in a way that leads to balanced load across the various memory banks. The idea is simple but effective, and leads to an interesting tradeoff between the quality of LSTM output and performance. I think it would be interesting to try to design a training method / regulariser that encourages this kind of structured sparsity in the first place.

Kees Vissers from Xilinx presented a keynote talk summarising their new Versal architecture, which the Imperial team had previously had the pleasure of hearing about from our alumnus Sam Bayliss. This is a really very different architecture to standard FPGA fare, and readers might well be interested in taking a look at Kees’s slides to learn more.

Vaughn Betz presented a paper from the University of Toronto, Math Doesn’t Have to be Hard: Logic Block Architectures to Enhance Low Precision Multiply-Accumulate on FPGAs. This work proposed a number of relatively minor tweaks to Intel FPGA architectures which might have a signifiant impact on low-precision MAC performance. Vaughn began by pointing out that in this application, very general LUTs often get wasted by being used as very simple gates – he gave the example of AND gates in partial product generation, and even as buffers. A number of architectural proposals were made to avoid this issue. I find this particularly interesting at the moment, because together with my PhD student Erwei Wang and others, I have proposed a new neural network architecture called LUTNet, motivated by exactly the same concern. However, our approach is the dual of that presented by Vaughn – we keep the FPGA architecture constant but modify the basic computations performed by the neural network to be more well-tuned to the underlying architecture. Expect a future blog post on our approach!

Lana Josipović presented the most recent work on the dynamically scheduled HLS tool from Paolo Ienne‘s group at EPFL, which they first presented at last year’s conference – see my blog post from last year. This time they have added speculative execution to their armoury. This is a very interesting line of work as HLS moves to encompass more and more complex algorithns, and Lana did a great job illustrating how it works.

Yi-Hsiang Lai presented HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing, an interesting collaboration between Zhiru Zhang‘s group at Cornell and Jason Cong‘s group. This work proposed separating functionality from implementation / optimisation concerns, such as datapath, precision and memory customisation, providing a cleaner level of abstraction. The approach seems very interesting, and reminded me of the aspect-oriented HLS work I contributed to in the REFLECT European project, about which Joāo Cardoso and others have since written a book. I think it’s a promising approach, and I’d be interested to explore the potential and challenges of their tool-flow. This paper won the best paper prize of the conference – congratulations to the authors!

My PhD student Jianyi Cheng presented our own paper, EASY: Efficient Arbiter SYnthesis from Multi-Threaded Code, and did an excellent job. Our paper is described in more detail in an earlier blog post.

Jianyi Cheng presenting our paper

Other papers I found particularly interesting include Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs, Microsemi’s contribution on analytic placement, ETH Zürich’s paper on an FPGA implementation of an approximate maximum graph matching algorithm, and U. Waterloo’s paper on a lightweight NoC making use of traffic injection regulation to avoid stalls. Unfortunately I had to miss the talks after noon on Tuesday, so there may well be more of interest in that part of the programme too.

The panel discussion – chaired by Deming Chen – was on the topic of whether FPGAs have a role to play in Supercomputing. As I pointed out in the discussion, to answer this question scientifically we need to have a working definition of “FPGA” and of “Supercomputing” – both seem to be on shifting sands at the moment, and we need to resist reducing a question like this to “does LINPACK run well on a Virtex or Stratix device.”

We also had the pleasure of congratulating Deming Chen and Paul Chow on their recently awarded fellowships, awarding a best paper prize, recognising several historical FPGA papers of significance, and last but by no means least welcoming the new baby of two of the stalwarts of the FPGA community – baby complete with “I am into FPGA” T-shirt! All this led to an excellent community feeling, which we should continue to nurture.



Efficient Memory via Formal Verification

My new PhD student Jianyi Cheng is presenting a very exciting paper at the ACM International Symposium on FPGAs (FPGA 2019). This is work he did for his Masters degree, and is a collaboration with Joy Chen and Jason Anderson at the University of Toronto, as well as Shane Fleming and myself at Imperial. In this blog post, I aim to summarise the main idea.

Multi-threaded programming is now a fairly mainstream activity, and has found its way into high-level synthesis tools, both through OpenCL and also LegUp pthreads support. We focus here on the latter.

At FPL 2017, Joy and Jason had a paper that automatically decided how to partition shared arrays for multi-threaded code, aiming to reduce the amount of arbitration required between hardware units and chunks of memory. Their approach used a simulation trace to identify candidate partitions, and designed the arbiters so that, for example, if accesses to partition P were only observed in that trace to come from thread T, then there is very low latency access to P from T at execution time. In this way, they were able to significantly speed up synthesised multi-threaded code making use of shared memories.

However, the arbiters were still there. They were necessary because while no access by some other thread T’ was observed during simulation, there was no guarantee that such an access might not occur at run-time. So the arbiters sat there, taking up FPGA area and – for large enough numbers of ports – hitting the critical path of the design.

Enter our work.

In our paper, we show – building on the excellent PhD thesis by Nathan Chong that I examined a few years back – how the original multi-threaded code can be translated into  single-threaded code in a verification language developed by Microsoft Research called Boogie. We then show how to automatically construct assertions in Boogie that, if passed, correspond to a formal proof that a particular thread can never access a particular partition. This lets us strip out the arbiters, gaining back the area and significantly boosting the clock frequency.

I think it’s a really neat approach. Please come and hear Jianyi give his talk and/or read the paper!


Neural Networks, Approximation and Hardware

My PhD student Erwei Wang, various collaborators and I have recently published a detailed survey article on this topic: Deep Neural Network Approximation for Custom Hardware: Where We’ve Been, Where We’re Going, to appear in ACM CSUR. In this post, I will informally explain my personal view of the role of approximation in supervised learning (classification), and how this links to the very active topic of DNN accelerator design in hardware.

We can think of a DNN as a graph G, where nodes perform computations and edges carry data. This graph can be interpreted (executed) as a function \llbracket G \rrbracket mapping input data to output data. The quality of this DNN is typically judged by a loss function \ell. Let’s think about the supervised learning case: we typically evaluate the DNN on a set of n test input data points x_i and their corresponding desired output y_i, and compute the mean loss:

L(G) = \frac{1}{n} \sum_{i=1}^n {\ell\left( \llbracket G \rrbracket(x_i), y_i \right)}

Now let’s think about approximation. We can define the approximation problem as – starting with G – coming up with a new graph G', such that G' can be somehow much more efficiently implemented than G, and yet L(G') is not significantly greater than L(G) – if at all. All the main methods for approximating NNs such as quantisation of activations and weights and sparsity – structured and unstructured – can be viewed in this way.

There are a couple of interesting differences here to the different problem – often studied in approximate computing, or lossy synthesis – of approximating the original function \llbracket G \rrbracket. In this latter setting, we can define a distance d(G',G) between G and G' (perhaps worst case or average case difference over the input data set), and our goal is to find a G' that keeps this distance bounded while improving the performance, power consumption, or area of the implementation. But in the deep learning setting, even the original network G is imperfect, i.e. L(G) > 0. In fact, we’re not really interested in keeping the distance between G and G' bounded – we’re actually interested bounding the distance between \llbracket G' \rrbracket and some oracle function defining the perfect classification behaviour. This means that there is a lot more room for approximation techniques. It also means that L(G') may even improve compared to L(G), as sometimes seen – for example – through the implicit regularisation behaviour of rounding error in quantised networks. Secondly, we don’t even have access to the oracle function, only to a sample (the training set.) These features combine to make the DNN setting an ideal playground for novel approximation techniques, and I expect to see many such ideas emerging over the next few years, driven by the push to embed deep learning into edge devices.

I hope that the paper we’ve just published in ACM CSUR serves as a useful reference point for where we are at the moment with techniques that simultaneously affect classification performance (accuracy / loss) and computational performance (energy, throughput, area). These are currently mainly based around quantisation of the datatypes in G (fixed point, binarisation, ternarisation, block floating point, etc.) topological changes to the network (pruning) and re-parametrisation of the network (weight sharing, low-rank factorisation, circulant matrices) as well as approximation of nonlinear activation functions. My view is that this is scratching the surface of the problem – expect to see many more developments in this area and consequent rapid changes in hardware architectures for neural networks!



The Growth Mindset

Over the last 5-10 years, the Growth Mindset has become a very popular feature of many schools across England. I have seen it implemented in a couple of schools, and I’m also aware that its initiator, Carol Dweck, gave an interview a couple of years ago where she criticised some implementations as “false growth mindset”.

In order to learn a bit more about the original research conducted by Dweck, I decided over the holiday to read her early book, ‘Self-theories: Their role in motivation, personality, and development’, Psychology Press, 1999. I have no background in psychology and a very limited background in educational theory, but I still want to know how much I can get from this as a parent, as an educator, and as a member of a school board.

As notes to myself, and for others who may be interested, I’m reporting the main take-away messages I got from the book in this post. I do not question the validity of any claims – I am not knowledgeable enough to do so – and I’m also very conscious that I have not had time to follow up the references to read the primary research literature. Instead, I cite below the chapters of the book in which the references can be found, should blog readers be interested in following up more deeply.

Two Theories of Intelligence

Dweck defines the seeking of challenge, the value of effort, and persistence in the face of obstacles as ‘mastery-oriented approaches’. She aims to knock down several ‘commonly held’ beliefs about what fosters such approaches: they are not more common in students with high ability, they are not necessarily improved by success in tasks, they are not improved by praise of students’ intelligence, and they are not even typically associated with students who have a high confidence in their intelligence. So what are the best approaches to fostering such qualities?

Dweck contrasts two theories of intelligence, which I’ve heard referred to in schools as “the fixed mindset” and “the growth mindset”. In the original research in this book, she refers to these as “The Theory of Fixed Intelligence” / “The Entity Theory” and “The Theory of Malleable Intelligence” / “The Incremental Theory”. In an experimental setting, failure is reported to motivate some students and demotivate others, in an apparently fairly bimodal distribution (Chapter 2).

To my mind, what’s missing from this discussion is a shared understanding of what intelligence actually is (Dweck picks this up much later in Chapter 9, on IQ tests). Intelligence, to me, describes the ability to learn and think – this seems to be a qualitative rather than a quantitative property. We could, of course, talk about speed or depth or some other quantification, and I’m aware that there’s a huge volume of work on this topic, about which I know little (any pointers for good books on this?) A principled definition of intelligence seems relevant because while I think nobody would say that a person’s knowledge is fixed, there is clearly a difference of opinion over the ability to gain such knowledge and skills – do people differ solely in the rate of development of knowledge / skills, or in the maximum level of knowledge / skills, or something else? And if there are such limits on the rate of change today for Person X, will those limits be different in the future for the same person? If the rate of change can change, can the rate of change of the rate of change change? And so, ad infinitum. And should we even care? Chapter 9 discusses pupils’ own views, with Dweck suggesting that entity theorists associate intelligence with inherent capacity or potential, while incremental theorists associate intelligence with knowledge, skills and effort. This actually surprised me – it seems that the perspective of the incremental theorists makes the very concept of intelligence – as distinct from knowledge, skills, and effort, superfluous. But it also seems to be somewhat inconsistent, because in Chapter 11 we learn that incremental theorists tend not to judge their classmates’ intelligence based on their performance in school. Perhaps the incremental theorists just have a hazier conception of intelligence in the first place?

What’s clear is that Dweck has no truck with those claiming that Growth Mindset means that “everyone can be an Einstein if you put in the effort” – it’s just that she strongly argues that potential cannot be readily measured based on current attainment – that there may well be undiscovered Einsteins in bottom set classes. These are not the same thing at all.

The Impact of Theories of Intelligence

Dweck then goes on to show that students’ theories of intelligence impact their choice of goals, with students holding the entity theory more likely to chose performance goals, given an option. She shows this to be a causal link, via appropriately designed experiments to temporarily alter students’ theories of intelligence.

Dweck shows that the goals given to students impact on whether they react with a “helpless” or a “mastery” response, even for the same task. Students given a “performance goal” are much more likely to produce a helpless response than those given a “learning goal”. Performance goals are fairly ubiquitous in the English education system, as individual target grades shared with pupils. I wonder whether her observation carries forward into this setting?

Dweck argues that pupils holding an entity model can sabotage their own attainment – withholding effort so that if they do poorly, they can blame their own lack of effort whereas if they do well, they feel validated in their innate intelligence (Chapter 6).

In Chapter 12, Dweck discusses pupils’ views of the belief in the potential to change and improve, and the impact of intelligence models on this belief – which plays out unsurprisingly. I’m more interested in similar beliefs held by teaching staff and how / whether they impact on their practice (does anyone know of any studies on this topic?)

One area where I found the book less precise is whether students can simultaneously be “more of an entity-theorist” in some subjects and “more of an incremental-theorist” in others. Often this was dealt with as if these were universal theories, but my limited experience suggests that students may, for example, hold largely incremental theories in sport while largely entity theories in maths. (Again, anyone know of studies on this topic?)

Changing Theories of Intelligence

So how do we change mindsets? One method Dweck refers to throughout, is to actually teach pupils about theories of intelligence. Another is to focus on the type of praise given: to emphasise an incremental model, praise successful strategies used on tasks they’ve clearly found challenging; quick correct answers should be responded to with apologies for wasting their time, and by setting more appropriate and challenging problems. This is subtly different advice to “praising only effort”, an approach I’ve seen some schools adopting when trying to apply the growth mindset. The best approach seems to be to ensure that challenge level is appropriate for each pupil, ensuring alignment between effort and outcome. Unfortunately, many primary schools in England are running in directly the opposite direction at the moment (see my blog post here); I do wonder what impact this is likely to have on the mindset of high-attaining pupils in the English education system.

In Chapter 15, Dweck looks at the kind of criticism and praise that reinforces these differing views. Criticism suggesting alternatives, e.g. “You’ve not quite done that completely. Maybe you should think of another way,” caused a reinforcement of incremental theories, whereas criticisms of the individual, e.g. “I’m disappointed in you”, tended to emphasise entity theories. More strikingly, Dweck argues strongly that positive praise targeted at inherent traits, e.g. “you’re smart!”, “you’re very good at this” or “I’m proud of you” can reinforce the entity theory, whereas praise such as “you’ve found a great way to do that – can you think of any other ways?” reinforces the incremental theory. While the former type of praise is definitely well received, and gives a temporary boost, Dweck argues that it sets pupils up for failure when they encounter difficulties and draw the inverse conclusion – “if I’ve not been successful, then I’m not smart, and you’re not proud of me”.

Finally, we only need to consider changing mindsets after mindsets are embedded. Dweck spends some space (Chapter 14) on arguing that the helpless-/mastery- dichotomy in responses is present even in 3.5-year-olds (where she associates this with a ‘theory of badness’ held by the children, rather than a ‘theory of intelligence’) so the mindset issue seems to be an issue for all phases of education.


Praise and Criticism. Students receive criticism and praise throughout their learning journey, and trying to change verbal feedback through training of staff is one thing to look at. However, it strikes me that one formalised arena for feedback, shared across parents, children and teachers, is in written “reports home”. I suspect it would be relatively easy to survey these reports for the type of language used, and compare this against the evidence Dweck presents on forms of criticism and praise. I’d be very interested in any schools that may have tried to survey or manage report language to align it with growth mindset principles. This also extends to grades: following Dweck’s results in Chapter 16 on “process praise”, it would seem far better to send home a report saying “worked on some great methods for X” rather than “Grade B”, or “could try alternative strategies for staying focussed” rather than “Grade C”.

Elective Remedial (Catch-up) Classes. Another interesting implication for schools and universities alike is the use of elective remedial classes. Several of Dweck’s studies seem to show that for those pupils who hold an entity theory of intelligence, it’s precisely those pupils who don’t need the remedial classes who are happy to attend them. Institutions should think about how to get around this problem.

School Transitions. There are implications for managing the transition from primary to secondary school, revealed by Dweck’s study of grade school to junior-high transition in the US; perhaps secondaries – jointly with primaries, even – could explicitly teach about theories of intelligence as part of the induction process, like the study at UC Berkeley reported in Chapter 5. I wonder whether any secondaries have tried this?

Mental Health. Mental health in educational settings is a hot topic at the moment. Given Dweck’s theories about self-esteem and its link to mindset, can recent work of schools and universities on mental health be improved by engaging with these ideas? For example, can mental health issues be avoided by trying to foster a growth mindset, and has any significant evidence been collected in this regard?

Grouping by attainment. I have seen many discussions of Growth Mindset that have suggested that grouping pupils by attainment runs counter to the principles outlined here. But interestingly, this is not what Dweck says (Chapter 17). She says that within the entity framework, this might be true, but attainment grouping within the incremental framework is not inherently problematic – it’s just an acknowledgement of fact. I would note that such groups are often referred to in education as “ability groups” rather than “attainment groups” – perhaps reflective of the entity theory. This issue potentially becomes even more acute when considering streaming and/or selective entry testing.

Gifted and Talented Programmes. There appear to be several implications for gifted and talented programmes (G&T) in schools (Dweck deals explicitly with this in Chapter 16, but does not draw out all the conclusions). Firstly, and essentially, we need to ensure all students are challenged, or they will not experience difficulty and effort; at the high-attaining end, this may or may not come from a G&T programme, depending on the pupil and the school approach to differentiation, but it cannot be absent. Secondly, perhaps the name G&T is problematic – Dweck herself says that “the term ‘gifted’ conjures up an entity theory,” and it’s not hard to imagine children in G&T programmes worrying more about losing G&T status than improving their knowledge and skills.

Teacher Mindsets. Although it would seem natural for teachers to have an incremental theory / growth mindset, my observations suggest this is not always the case. I wonder whether any schools have undertaken studies of their own teaching staff in this regard – this could be very interesting.

Beyond Intelligence

Chapter 10 shows that very similar observations apply to personal and social relationships, and Chapter 13 argues that theories of intelligence are also closely associated with the formation of stereotypes. Chapter 17 describes a link with self-esteem, and suggests that parents and teachers alike can model feeling good about effortful tasks, as a route to self-esteem within the incremental model. and that entity models are correlated with depression and anxiety (Chapter 7).

Overall, this book has given me plenty to think about as a parent, and a fair bit to think about as an educator too. I’d be really interested in hearing people’s suggestions for more reading on the topics above, especially if any of the studies I suggest above have already been done in the psychology or education literature.

Readers who enjoyed this post might be interested in my other educational posts.

Teaching Curriculum Beyond Year Group

I have lost track of the number of times that I’ve been told by parents of primary-age children in England that schools are claiming that they are “not allowed” to teach content beyond that set out for the child’s year group in the English National Curriculum, ever since the curriculum reforms in 2014.

This myth seems to be so embedded that I have heard it myself from numerous headteachers and teaching staff.

Instead of spending the time explaining the actual situation afresh each time I am asked, I have instead put it down as this brief explanatory blog post. I hope people find it helpful.

Firstly, different schools will have different policies. It may be school policy to do / not to do something with the curriculum, but this is determined by the school alone, acting in line with the statutory framework. For academies, the statutory framework is typically minimal. Maintained schools must follow the statutory National Curriculum, and – in practice – every academy I’ve come across also abides by these regulations.

Presumably, the myth started because the National Curriculum Programmes of Study [Maths, English] are set out as expectations by year group. However, the programmes very clearly state:

“Within each key stage, schools therefore have the flexibility to introduce content earlier or later than set out in the programme of study.”

“schools can introduce key stage content during an earlier key stage, if appropriate.”

(see Section “School Curriculum” in either the Maths or the English Programme of Study.)

This must be read in the context of the broader thrust of the programmes, which state:

The expectation is that the majority of pupils will move through the programmes of study at broadly the same pace. However, decisions about when to progress should always be based on the security of pupils’ understanding and their readiness to progress to the next stage. Pupils who grasp concepts rapidly should be challenged through being offered rich and sophisticated problems before any acceleration through new content. Those who are not sufficiently fluent with earlier material should consolidate their understanding, including through additional practice, before moving on.

So, put simply, schools can certainly teach children content above their year group. But only if they’re ready for it. Common sense, really.

If you really want to know more about my views on education, then please click on the “Education” link on this blog post to find related posts.

Approximation of Boolean Functions

Approximate Computing has been a buzzphrase for a while. The idea, generally, is to trade off quality of result / solution, for something else – performance, power consumption, silicon area. This is not a new topic, of course, because in numerical computation people have generally always worked with finite precision number representations. In my early work in 2001, before the phrase “Approximate Computing” was in circulation, I introduced this as “Lossy Synthesis” – the idea that circuit synthesis can be broadened to incorporate the automated control of loss of numerical quality in exchange for reduction in area and increase in performance.

Most approximate computing frameworks focus on domains where numerical error is tolerable. Perhaps we don’t care if our answer is 1% wrong, for example, or perhaps we don’t even care if it’s out by 100%, so long as that happens very infrequently.

However, there is another interesting class of computation. Consider a function producing a Boolean output f : \chi \to {\mathbb B}, where {\mathbb B} = \{T, F\}. An interesting challenge is to produce another function \tilde{f} : \chi \to {\mathbb T} with a ternary output {\mathbb T} = \{T, F, -\} bearing a close resemblance to f. We can make the idea of bearing a close resemblance precise in the following way: if \tilde{f} declares a value true (false), then so must f. We can think of this as relation between fibres:

\tilde{f}^{-1}(\{T\}) \subseteq f^{-1}(\{T\}) and \tilde{f}^{-1}(\{F\}) \subseteq f^{-1}(\{F\})            (1)

We can then think of the function \tilde{f} as approximating f if the fibre of the ‘don’t know’ element, -, is small in some sense, e.g. if |\tilde{f}^{-1}(\{-\})| is small.

In the context of approximate computing, we can pose the following optimisation problem:

\min_{\tilde{f}}: \mbox{Cost}(\tilde{f}) subject to |\tilde{f}^{-1}(\{-\})| < \tau and (1),

where \mbox{Cost} represents the cost (energy, area, latency) of implementing a function. One application area for this kind of investigation is in computer graphics. It is often the case that, when rendering a scene, an algorithm first needs to decide which components of the scene will definitely not be visible, and therefore need not be considered further. Should this part of the graphics pipeline make a mistake by deciding a component may be visible when it is actually invisible, little harm is done – more computation is required downstream in the graphics pipelining, costing energy and time, but not a reduced quality rendering. On the other hand, if it makes a mistake by deciding that a component is invisible when it is actually visible, this may cause a significant visual artefact in the rendered scene.

Last year, I had a bright Masters student, Georgios Chatzianastasiou, who decided to explore this problem in the context of f being the Slab Method in computer graphics and \tilde{f} being one of a family of approximations \tilde{f}_p, each produced by using interval arithmetic approximations to f computed in floating-point with precision p. In this way we get a family of approximate computing hardware IP blocks, all of which guarantee that, when given a ray and a bounding box, if the IP reports no intersection between the two, then there is provably no intersection. Yet each family member operates at a different precision, requiring different circuit area, trading off against the rate of `false positives’. Georgios wrote a paper on the implementation, which was accepted by FPL 2018 – he presents it next Wednesday.

If you’re at the FPL conference, please go and say hello to Georgios. If you’re interested in working with me to deepen and broaden the scope of this work, please get in touch!