When are Digits Correct?

Often, we compute with iterative algorithms. Start with some value, often an initial guess to be refined, and keep iterating until some stopping criterion is met. If we actually think about what goes on in a modern digital computer when we execute these algorithms, we soon see that – often – the same digits end up being computed time and again. As we converge to a value, it’s reasonable to expect that most of the time the most significant digits become stable. But we still compute them, time and again at each iteration, wasting computational resource.

In general, in standard binary representations, this re-computation may not be avoidable – most-significant digits might be stable for 1000 iterations and then flip, e.g. from 0.99999 to 1.00000. As a child, I used to play with such iterations using my HP32S calculator – a gift from fred harris – it provided endless entertainment.

There is, however, a class of number representations in which these digit flips can be avoided: redundant number representations. These representations have a long history – indeed, as my friend and colleague Miloš Ercegovac has identified, they can be traced back as far as a 1727 publication in Phil. Trans. Royal Soc by John Colson FRS. Miloš developed these ideas to a mature state between the 1970s and today, in the guise of Online Arithmetic.

Together with my PhD students He Li (now research staff at Cambridge) and Ian McInerney and collaborator James Davis, I have recently done some work on methods to detect and establish exactly when digits become stable using such schemes and what the implications might be for hardware that makes use of this stability. In our recent IEEE Transactions on Computers paper, we adapt standard forward error analyses of stationary iterative methods to this setting. We mathematically derive some conditions that can be checked at run-time to determine when you don’t need to compute certain digits in any future iteration, and also present a toy hardware implementation taking advantage of this approach using a non-standard arithmetic processor design.

We hope that – in the future – only what needs to be computed will be computed.

Award of GCSEs and A-Levels in 2020

Readers of my blog based in England may know that due to COVID-19, GCSEs (typically taken at age 16) and A-Levels (age 18) are not going ahead as exams this year. Yesterday, the Office for Qualifications and Examinations Regulation (Ofqual) published a consultation on the methods to be used to ensure fairness in the award of these important qualifications. I intend to respond to this consultation, which is only open for two weeks, and have produced a draft response below. Before I submit it, I would welcome any feedback. Equally, others should feel free to borrow from my response if it helps them.

Centre Assessment Grades

To what extent do you agree or disagree that we should incorporate the requirement for exam boards to collect information from centres on centre assessment grades and their student rank order, in line with our published information document, into our exceptional regulatory requirements for this year?

Agree

To what extent do you agree or disagree that exam boards should only accept centre assessment grades and student rank orders from a centre when the Head of Centre or their nominated deputy has made a declaration as to their accuracy and integrity?

Strongly Agree

To what extent do you agree or disagree that Heads of Centre should not need to make a specific declaration in relation to Equalities Law?

Disagree

To what extent do you agree or disagree that students in year 10 and below who had been entered to complete exams this summer should be issued results on the same basis as students in year 11 and above?

Strongly Agree

To what extent do you agree or disagree that inappropriate disclosure of centre assessment judgements or rank order information should be investigated by exam boards as potential malpractice?

Neither Agree not Disagree

Do you have any comments about our proposals for centre assessment grades?

  1. While a separate Equalities Law declaration is not necessary, the Head of Centre should be able to declare that they have taken equality law into consideration as part of their declaration.
  2. Ofqual should liaise with the National Governance Association and with teaching unions to provide guidance to governing bodies and staff on appropriate challenge and support to schools in order to ensure processes underlying Head of Centre declaration are appropriately evidenced.
  3. While I understand and support the motivation for labelling inappropriate disclosure of centre assessments as malpractice, care must be taken and guidance given to centres over what is deemed “inappropriate”. I would not want to be in the situation where a teacher is unable to calm a student in a way they normally would, for example by telling them that “I can’t see any way you won’t get a Grade 7”. There may be an equalities implication for those students suffering from extreme anxiety, and this should be considered when drawing up guidance for centres.
  4. While I accept that there is little time to provide detailed guidance for centres to follow when drawing up rank-order lists, the publication of examples of good practice may help centres, and I would recommend this is considered.

Issuing Results

To what extent do you agree or disagree that we should incorporate into the regulatory framework a requirement for all exam boards to issue results in the same way this summer, in accordance with the approach we will finalise after this consultation, and not by any other means?

Strongly Agree

Do you have any comments about our proposal for the issuing of results?

None

Impact on Students

To what extent do you agree or disagree that we should only allow exam boards to issue results for private candidates for whom a Head of Centre considers that centre assessment grades and a place in a rank order can properly be submitted?

Agree

To what extent do you agree or disagree that the arrangements we put in place to secure the issue of results this summer should extend to students in the rest of the UK?

Strongly agree

To what extent do you agree or disagree that the arrangements we put in place to secure the issue of results this summer should extend to all students, wherever they are taking the qualifications?

Neither agree nor disagree

Do you have any comments about the impact of our proposals on any particular groups of students?

  1. Unfortunately, I see no other option than that proposed for private candidates. However, I am concerned that the definition of “properly” in the criterion given is made much more explicit and in objective terms to the heads of centres.
  2. I suggest legal advice is sought over the enforceability of arrangements within centres outside the UK, in particular over the implications of breach of a head of centre’s declaration before proceeding with treating them the same as those within the UK.
  3. I am concerned over the impact of the proposed arrangements for some groups of students who may be differentially affected by the change in routine due to lockdown, e.g. those with Autistic Spectrum Conditions (ASC). In order to be as fair as possible to these students, I suggest that explicit guidance be given to centres emphasising that centres are free to disregard any dip in attainment since lockdown when coming up with their rank-order list, and again emphasising their duties under equalities legislation.

Statistical standardisation of centre assessment grades

To what extent do you agree or disagree with the aims outlined above?

Agree

To what extent do you agree or disagree that using an approach to statistical standardisation which emphasises historical evidence of centre performance given the prior attainment of students is likely to be fairest for all students?

Agree

To what extent do you agree or disagree that the trajectory of centres’ results should NOT be included in the statistical standardisation process?

Agree

To what extent do you agree or disagree that the individual rank orders provided by centres should NOT be modified to account for bias regarding different students according to their particular protected characteristics or their socio-economic backgrounds?

Agree

To what extent do you agree or disagree that we should incorporate the standardisation approach into our regulatory framework?

Agree

Do you have any comments about our proposals for the statistical standardisation of centre assessment grades?

  1. I am unclear from the consultation on whether standardisation is to occur on an exam-board basis or across exam boards. If it is on an exam-board basis, it is not clear what will happen when centres have changed exam board over the time window used to judge prior grade distribution at the school, especially if the change is for the first time this year.
  2. I have several statistical concerns over the proposed methodology, given the level of detail discussed so far. In particular,
    (i) there is a recognition that small centres or small cohorts will be difficult to deal with – this is a significant issue, and may be exacerbated depending on the definition of cohort (see #3, below), leading to significant statistical uncertainty;
    (ii) it is hugely important to avoid 2020 results being affected by outlier results in previous years. One possibility is to use median results from the previous three years – I would avoid using mean results or a single year’s results.
    Given these concerns, my view is that it would be more appropriate to award a “grade range” to students (e.g. “9-7”, which may of course include degenerate ranges like just “7”). This allows statistical uncertainty arising from the various measures integrated into the standardisation algorithm to be explicitly quantified and provide a transparent per-pupil result. It would allow universities and sixth-forms to decide for themselves whether to admit a pupil optimistically, pessimistically or on the basis of the interval midpoint.
  3. It is unclear from the consultation whether the past grade distributions used will be on a per-subject basis. If not, this is likely to violate proposed Aim 1 of the standardisation process. However, if so, this is likely to result in some very small cohorts for optional subjects at particular centres, so extreme statistical care must be taken in using these cohorts as the basis for grading in 2020. A possible solution us to produce grade ranges, as above.
  4. From a statistical perspective, estimation of grade distributions at a per-centre level (rather than estimation of mean grade, for example) is fraught with danger and highly sensitive to cohort size. It is very important that you do not consider the empirical frequency distribution of grades in a centre over the last 1,2 or 3 years as the underlying probability distribution but rather as a sample from the latter, using an appropriate statistical method to estimate the distribution from the sample. Such methods would also allow the incorporation of notions of variance, which could be factored into the “grade ranges” for students, explained in #2. As an extreme example: if a centre had no Grade 6’s last year, only 5’s and 7’s, we should not bias our model to no Grade 6’s this year, surely.
  5. There is an additional option for standardisation, not considered in the consultation document, which is less subject to the statistical problems of distribution estimation. You could extract just one or two parameters from your model (e.g. desired mean, desired standard deviation) and use these to normalise the distribution from each centre, rather than fit the complete distributions. Such aggregate statistics will be less susceptible to variation, especially for smaller cohorts.
  6. I am unclear how it is possible to award grades to students at centres without any historical outcomes and with no prior attainment data or prior attainment data covering a statistically-insignificant portion of the cohort. For these centres, some form of moderation or relying on Autumn term exam results may be required.
  7. I am concerned by the statement in the consultation that “we will evaluate the optimal span of historical centre outcomes (one, 2 or 3 years). We will select the approach that is likely to be the most accurate in standardising students’ grades.” There is no discussion of how “most accurate” can be judged; there is no data upon which to make this decision, so I would urge caution and an outlier-rejection strategy (see #2 above).
  8. While I broadly agree that there is insufficient data upon which to base rank-order modification based on protected characteristics or socio-economic backgrounds, of the three approaches discussed in the consultation document, the “second approach” is currently very vague and needs further refinement before I can offer an opinion on it. I am happy to be contacted for further comment on this in the future.
  9. I am concerned by the absence of a mechanism to flag unusual rank order differences between subjects in a centre. It should be possible to identify, for example, pupils ranked very high in Subject A and very low in Subject B compared to the typical centile differences in rankings between these subjects, for further investigation by the exam boards. The sensitivity of such a test could be set an appropriate level to the amount of staff time available to investigate.

 

Appealing the results

To what extent do you agree or disagree that we should not provide for a review or appeals process premised on scrutiny of the professional judgements on which a centre’s assessment grades are determined?

Agree

To what extent do you agree or disagree that we should not provide for a student to challenge their position in a centre’s rank order?

Agree

To what extent do you agree or disagree that we should not provide for an appeal in respect of the process or procedure used by a centre?

Strongly disagree

To what extent do you agree or disagree that we should provide for a centre to appeal to an exam board on the grounds that the exam board used the wrong data when calculating a grade, and/or incorrectly allocated or communicated the grades calculated?

Strongly Agree

To what extent do you agree or disagree that for results issued this summer, exam boards should only consider appeals submitted by centres and not those submitted by individual students?

Strongly disagree

To what extent do you agree or disagree that we should not require an exam board to ensure consent has been obtained from all students who might be affected by the outcome of an appeal before that appeal is considered?

Agree

To what extent do you agree or disagree that exam boards should not put down grades of other students as a result of an appeal submitted on behalf of another student?

Strongly agree

To what extent do you agree or disagree that exam boards should be permitted to ask persons who were involved in the calculation of results to be involved in the evaluation of appeals in relation to those results?

Disagree

To what extent do you agree or disagree that exam boards should be able to run a simplified appeals process?

Neither agree nor disagree

To what extent do you agree or disagree that we should not provide for appeals in respect of the operation or outcome of the statistical standardisation model?

Strongly agree

To what extent do you agree or disagree with our proposal to make the Exam Procedures Review Service (EPRS) available to centres for results issued this summer?

Strongly agree

Do you have any comments about our proposals for appealing results?

  1. I disagree with the absence of an appeal procedure against centre procedure. While recognising the difficulties faced by centres and the exceptional circumstances, there is an element of natural justice that must be maintained. Without such an appeal process, there is no safeguard against centres using completely inappropriate mechanisms to derive grade and rank orders, beyond the signed statement from the head of centre. While the consultation suggests that detailed guidance will not be sent to centres on the procedures they should follow, it is reasonable to expect a centre – if challenged by a sufficient number of candidates – to explain the procedure they did follow, and for an appeal body to find this to be reasonable or unreasonable in the circumstances. The outcome of any successful appeal may have to be the cancelling of all grades in a certain subject at a certain centre, requiring a fall-back to the Autumn 2020 exams, but the mere existence of such a mechanism may help focus centres on ensuring justifiable procedures are in place.
  2. The consultation document leaves open the question of what role staff of exam boards who were involved in the calculation of results would have in appeals. It appears proper for them to be involved in providing evidence to an independent appeals committee, but not to form such a committee.

 

An Autumn exam series

To what extent do you agree or disagree that entries to the autumn series should be limited to those who were entered for the summer series, or those who the exam board believes have made a compelling case about their intention to have entered for the summer series (as well as to students who would normally be permitted to take GCSEs in English language and mathematics in November)?

Agree

To which qualifications the emergency regulations will apply

To what extent do you agree or disagree that we should apply the same provisions as GCSE, AS and A level qualifications to all Extended Project Qualifications and to the Advanced Extension Award qualification?

Strongly agree

Do you have any comments about the qualifications to which the exceptional regulatory measures will apply?

None

Building the arrangements into our regulatory framework

To what extent do you agree or disagree that we should confirm that exam boards will not be permitted to offer opportunities for students to take exams in May and June 2020?

Disagree

To what extent do you agree or disagree with our proposals that exam boards will not be permitted to offer exams for the AEA qualification or to moderate Extended Project Qualifications this summer?

Disagree

Do you have any comments about our proposals for building our arrangements into our regulatory framework?

I have sympathy with the proposals in this section, but they need to be balanced against the harm done to those candidates who will be unable to use centre-based assessments and against Ofqual’s duties under the Equalities legislation, given that this may disproportionately affect disabled students (see pp.51-52 of the consultation document.) On balance, it may be better to leave this as an operational decision between exam boards and exam centres to allow exams in May and June, if possible, only for these students.

 

Equality impact assessment

Are there other potential equality impacts that we have not explored? What are they?

As previously noted, I am concerned over the impact of the proposed arrangements for some groups of students who may be differentially affected by the change in routine due to lockdown, e.g. those with Autistic Spectrum Conditions (ASC). In order to be as fair as possible to these students, I suggest that explicit guidance be given to centres emphasising that centres are free to disregard any dip in attainment since lockdown when coming up with their rank-order list, and again emphasising their duties under equalities legislation.

We would welcome your views on how any potential negative impacts on particular groups of students could be mitigated:

If Ofqual were to adopt a “grade range” approach, outlined above, then the quoted research into the reliability of predicted GCSE, AS and A-levels prior to 2015 could be used to inform the degree of uncertainty in the range, mitigating the impact on particular groups of students.

Learning in Vienna

IMAG1453

I was delighted to be asked by Axel Jantsch to speak recently at the opening of the new Christian Doppler Laboratory (CDL) for Embedded Machine Learning. This is a new collaborative laboratory between TU Wien (led by Jantsch) and TU Graz (Bischof) as well as several strong industrial partners from Germany and Austria. Its scope and content is very closely aligned to the EPSRC Center for Spatial Computational Learning, which I launched last November, the latter bringing together Imperial and Southampton in the UK with Toronto in Canada, UCLA in the USA, and – just announced – Sydney in Australia. I was therefore delighted to bring fraternal greetings from our centre, and to begin discussions over how we could work together in the future to build a truly global collaborative research effort in this space.

In addition to hearing about the work plan and objectives for the new CDL, the meeting heard from three external academics (Christoph Lampert, Bernt Schiele and myself) on their recent relevant research. I spoke about the work I recently published in my article in the Philosophical Transactions of the Royal Society. I found both Christoph and Bernt’s talks inspiring – I briefly summarise the key points of interest to me, below.

Christoph Lampert from IST spoke on Embedded and Adaptive Models for Visual Scene Understanding. He explained that while others are working on more efficient hardware and more efficient ML models, his group is focusing on matching models to problems and making models adaptive to the environment in which they’re applied. With this in mind, he focused on two recent papers, one from WACV 2020 and one from ICCV 2019. The WACV paper is on object detection, proposing compute-efficient method that avoids the twin pitfalls of proposal-based systems like Faster R-CNN and grid-based methods like YOLO. The ICCV paper is on multi-exit architectures, where latency constraints can cause an early termination of deep neural network inference computation and we want to have a useful result still in these circumstances. The paper discusses how to train such networks using ideas from Hinton’s et al.’s “Knowledge Distillation”. This also reminded me of some work from our research group [1,2] featuring early exits or a focus on getting good results quickly in latency-constrained systems.

Bernt Schiele from MPI and Saarland University spoke on Bright and Dark Sides of Computer Vision and Machine Learning. He spoke about several very interesting topics, including scene context (e.g. ‘Not Using the Car to See the Sidewalk‘, CVPR 2019), Disentangling Adversarial Robustness and Generalization (CVPR 2019) and reverse engineering and stealing deep models (e.g. ‘Knock-off Nets’, CVPR 2019). All are very interesting and timely topics, but the work on disentangling adversarial robustness and generalisation was particularly interesting to me, since I’ve given the topic of generalisation some thought in the context of efficient DNN accelerator hardware. Schiele argued for a stronger distinction to be made in the research community between different classes of adversarial examples — he focused on the idea of “on-manifold adversarial examples” where an adversarial example needs to actually be a correct instance of the class, rather than an arbitrary perturbation of an image — the latter commonly used in the literature, which Schiele referred to as a ‘regular’ adversarial example. His talk showed how on-manifold examples could be studied. The main take-home messages were that ‘regular’ robustness and generalisation are not contradictory, but that ‘on-manifold’ adversarial examples can be found, and such robustness in this instance is generalisation.

Highlights from FPGA 2020

John's Blog

Here are a few personal highlights from the FPGA 2020 conference, which took place this week in Seaside, California. (Main photo credit: George Constantinides.)

Jakub Szefer‘s invited talk on “Thermal and Voltage Side Channels and Attacks in Cloud FPGAs” described a rather nifty side-channel through which secrets could be leaked in the context of cloud-based FPGAs. The context is that one user of an FPGA would like to transmit information secretly to the next user of the same FPGA. Jakub suggests transmitting this information via the physical temperature of the FPGA. His proposal is that if the first user would like to transmit a ‘1’, they heat up the FPGA by running a circuit that uses a lot of current, such as a ring oscillator; if they would like to transmit a ‘0’, they don’t. The second user, once they arrive, measures the temperature of…

View original post 1,412 more words

When to Schedule?

On Tuesday, Jianyi Cheng will present our recent work Combining Dynamic and Static Scheduling in High-level Synthesis at the ACM International Symposium on FPGAs in Monterey. This is joint work between Jianyi and his two supervisors, John Wickerson and myself, as well as our collaborators from EPFL, Lana Josipović and her PhD supervisor Paolo Ienne.

As I’ve described in previous blog posts [1,2], Lana has been doing interesting work over the last few years, developing a tool flow for dynamically-scheduled high-level synthesis (HLS). Typically in modern HLS tools like VivadoHLS or LegUp, scheduling decisions are made statically – at compile time. However, Lana’s tool flow makes these decisions dynamically, at run time, using handshaking circuitry, reminiscent of Page and Luk’s work compiling occam into FPGAs.

In our paper, we have teamed up with EPFL to build a flow that can result in the best of both worlds. Static scheduling can be very efficient, sharing resources and leading to low area designs. Dynamic scheduling can be very fast, working around actual rather than potential data dependencies. Jianyi’s paper allows the definition of statically scheduled functions within a dynamically scheduled program. He shows that over a range of benchmarks, the results are about half the area of the fully dynamically-scheduled designs while about 1.7x faster than the fully statically-scheduled designs.

Screenshot 2020-02-21 at 11.07.12

Arithmetic for Neural Networks

Last month, the Royal Society Phil Trans A published my paper Rethinking Arithmetic for Deep Neural Networks, part of a special issue put together by Jack Dongarra, Laura Grigori and Nick Higham. In this blog post, I aim to briefly introduce the ideas in the paper. The paper is open access, so please read it for further detail. In addition, my slides from a talk given on an early version of this work are available from Nick Higham’s blog, and an mp3 recording of me talking to these slides has been made available by the Royal Society here.

The central theme of the paper is that hardware accelerator circuits for neural networks can actually be treated as neural networks. Consider the two graphs below. One of them represents a simple deep neural network where each node performs an inner product and a ReLU operation. The other represents the result of (i) deciding to used 4-bit fixed-point arithmetic, and then (ii) synthesising the resulting network into a circuit, technology-mapped to 2-input Boolean gates.

Although there are obvious differences (in structure, in number of nodes) – there is a core commonality: a computation described as a graph operating on parameterisable functional nodes.

So what can we gain from this perspective?

1. Binarized Neural Networks are universal. The paper proves that any network you want to compute can be computed using binarized neural networks with zero loss in accuracy. It’s simply not the case that some problems need high precision. But, as a corollary, it is necessary to not be tied too closely to the original network topology if you want to be guaranteed not to lose accuracy.

2. Boolean topologies are tricky things. So if we want to rethink topologies, what first principles should we use to do so? In the paper, I suggest looking to inspiration from the theory of metric spaces as one step towards producing networks that generalise well. Topology, node functionality, and input / output encoding interact in subtle, interesting, and under-explored ways.

3. This viewpoint pays practical dividends. My PhD student Erwei Wang and collaborators James Davis and Peter Cheung and I have developed a Neural Network flow called LUTNet, which uses Boolean lookup tables as the main computational node in a neural network, leading to very efficient FPGA implementations.

I’m hugely excited by where this work could go, as well as the breadth of the fields it draws on for inspiration. Please do get in touch if you would like to collaborate on any of the open questions in this paper, or any other topic inspired by this work.

 

 

Book Review: Category Theory for the Sciences

I have been reading Spivak’s book ‘Category Theory for the Sciences’ on-and-off for the last 18 months during vacations as some of my ‘holiday reading.’ I’ve wanted to learn something about category theory for a while, mainly because I feel it’s something that looks fun and different enough to my day-to-day activity not to feel like work. I found Spivak’s book on one of my rather-too-regular bookshop trips a couple of years ago, and since it seemed like a more applied introduction, I thought I’d give it a try. This blog post combines my views on the strengths of this book from the perspective of a mathematically-minded engineer, together with a set of notes on the topic which might be useful for me – and maybe other readers – to refer back to in the future.

Screenshot 2019-08-01 at 07.40.30.png
Category Theory by the Pool, Summer 2019

The book has a well-defined structure. The first few chapters cover topics that should be familiar to applied mathematicians and engineers: sets, monoids, graphs, orders, etc., but often presented in an unusual way. This sets us up for the middle section of the book which presents basic category theory, showing how the structures previously discussed can be viewed as special cases within a more general framework. The final chapters of the book discuss some topics like monads that can then be explored using the machinery of this framework.

Set

Chapter 3 covers the category \mathbf{Set} of sets.

One of the definitions here that’s less common in engineering is that of coproduct. The coproduct of sets X and Y, denoted X \sqcup Y is defined as the disjoint union of X and Y. I first came across disjoint union as a concept when doing some earlier holiday reading on topology. We also here get the first experience of the author’s use of informal ‘slogans’ to capture the ideas present in more formal definitions and theorems: “any time behavior is determined by cases, there is a coproduct involved.” I really like this approach – it’s the kind of marginalia I usually write in books myself, but pre-digested by the author!

The notation \text{Hom}_{\mathbf{Set}}(X,Y) is used for the set of functions from X to Y, which we are encouraged to refer to as morphisms, all without really explaining why at this stage – it of course all becomes clear later in the book.

To give a flavour of the kind of exercises provided, we are asked to prove an isomorphism \text{Hom}_{\text{\bf Set}}(X \sqcup Y, A) \xrightarrow{\cong} \text{Hom}_{\text{\bf Set}}(X, A) \times \text{Hom}_{\text{\bf Set}}(Y, A). This reminds me of digital circuit construction: the former corresponds to a time-multiplexed circuit capable of producing outputs from A with inputs from either X or Y, together with an identifier for which (to feed a multiplexer select line); the latter corresponds to two circuits producing both the outputs rather than just the selected one.

The fibre product, X \times_Z Y, given functions f : X \to Z and g : Y \to Z is defined as the set X \times_Z Y = \{ (x, z, y) | f(x) = z = g(y)\}. Any set isomorphic to this fibre product is called a pullback of the diagram X \xrightarrow{f} Z \xleftarrow{g} Y.

The fibre sum, X \sqcup_W Y, given functions f : W \to X and g : W \to Y is the quotient of X \sqcup W \sqcup Y by the equivalence relation \sim generated by w \sim f(w) and w \sim g(w) for all w \in W. A pushout of X and Y over W is any set Z which is isomorphic to X \sqcup_W Y.

A good intuitive description in the text is to think of the fibre sum of X and Y as gluing the two sets together, with W as the glue. A nice example from one of the exercises is W = \mathbb{N}, X = \mathbb{Z}, Y a singleton set, f : W \to X defined by w \mapsto -(w+1). Then we’re left with X \sqcup_W Y \cong \{ -, 0, 1, \ldots \}, where - is an element corresponding to the collapsing of all negative elements in X to a single point.

The familiar notions of surjective and injective functions are recast as monomorphisms and epimorphisms, respectively. A function f : X \to Y is a monomorphism if for all sets A and pairs of functions g, g' : A \to X, if fg = fg' then g = g'. A function f : X \to Y is an epimorphism if for all sets B and pairs of functions h, h' : Y \to B, if hf = h'f then h = h'. By this time in the book, we get the hint that we’re learning this new language as it will generalise later beyond functions from sets to sets – the chapter ends with various slightly more structured concepts, like set-indexed functions, where the idea is always to define the object and the type of morphism that might make sense for that object.

Monoids, Groups, Graphs, Orders & Databases

In Chapter 4, the reader is introduced to several other structures. With the exception of the databases part, this was familiar material to me, but with slightly unfamiliar presentation.

We learn the classical definition of a monoid and group, and also discuss the free monoid generated by a set X as being the monoid consisting of set \mbox{List}(X) of lists of elements of X, with the empty list as the identity and list concatenation as the operation. We discuss the idea of a monoid action on a set S, i.e. a function \circlearrowleft: M \times S \to S associated with the monoid M having the properties e \circlearrowleft s = s for any s \in M, where e is the identity, and m \circlearrowleft (n \circlearrowleft s) = (m \star n) \circlearrowleft s, where m and n are any elements of M, s is any element of S, and \star is the monoid operation. The discussion of monoids is nice and detailed, and I enjoyed the exercises, especially those leading up to one of the author’s slogans: “A finite state machine is an action of a free monoid on a finite set.”

Graphs are dealt with in a slightly unusual way (for me) as quadruples G := (V, A, \mbox{src}, \mbox{tgt}) of a set of vertices V, a set of arrows A, a functions \mbox{src}, \mbox{tgt} mapping each arrow into its source and destination vertex. This corresponds then to what I would call a multigraph, as parallel edges can exist, though it is called a graph in the book.

For each of the structures discussed, we finish by looking at their morphisms.

As an example, let {\mathcal M} := (M, e, \star) and {\mathcal M'} := (M', e', \star') be monoids (the tuple is of the set, identity element, and operation). Then a monoid homomorphism f: {\mathcal M} \to {\mathcal M'} is a function f: M \to M' preserving identity and operation, i.e. (i) f(e) = e' and f(m_1 \star m_2) = f(m_1) \star' f(m_2) for all m_1, m_2 \in M.

Similarly, for graphs, graph homomorphisms f: G \to G' for G := (V, A, \mbox{src}, \mbox{tgt}) and G' := (V', A', \mbox{src}', \mbox{tgt}') are defined as pairs of functions f_0: V \to V' and f_1: A \to A' preserving the endpoints of arrows, i.e. such that the diagrams below commute.

Order morphisms are similarly defined, this time as those preserving order.

Thus we are introduced to several different (somewhat) familiar mathematical structures, and in each case we’re learning about how to describe morphisms on these structures that somehow preserve the structure.

Categories, Functors, and Natural Transformations

Things really heat up in Chapter 5, where we actually begin to cover the meat of category theory. This is exciting and kept me hooked by the pool. The unusual presentation of certain aspects in previous chapters (e.g. the graphs) all made sense during this chapter.

A category \mathcal{C} is defined as a collection of various things: a set of objects \text{Ob}(\mathcal{C}), a set of morphisms \text{Hom}_{\mathcal{C}}(x,y) for every pair of objects x and y, a specified identity morphism \text{id}_x for every object x, and compositions \circ: \text{Hom}_{\mathcal{C}}(y,z) \times \text{Hom}_{\mathcal{C}}(x,y) \to \text{Hom}_{\mathcal{C}}(x,z). To qualify as a category, such things must satisfy identity laws and also associativity of composition.

Examples of the category of Monoids, of Groups, of Sets, of Graphs, of Pre-Orders, etc., are explored. Isomorphisms between objects in a category are carefully defined.

Functors are introduced as transformations from one category to another, via a function transforming objects and a function transforming morphisms, such that identities and composition and preserved. Functors are then used to rigorously explore the connections between preorders, graphs, and categories, and I finally understood why the author had focused so much on preorders in the previous chapters!

Things started to take a surprising and exciting turn for me when the author shows, via the use of functors, that a monoid is a category with one object (the set of the monoid becomes the morphisms in the category), and that groups can therefore be understood as categories with one object for which each morphism is an isomorphism. This leads to a playful discussion, out of which pops groupoids, amongst other things. Similar gymnastics lead to preorders reconstituted as categories in which hom-sets have one or fewer elements, and graphs reconstituted as functors from a certain two-object category to \mathbf{Set}. Propositional logic is re-examined from a categorical perspective, though I suspect this is just scratching the surface of the interplay between category theory and logic, and I’d love to read more about this. For me, this was a highlight – that by playing with these definitions and trying things out, all these wonderful familiar structures arise.

The final part of this chapter deals with natural transformations. Just as functors transform categories to categories, natural transformations transform functors to functors. Specifically, a natural transformation between functor F : {\mathcal C} \to {\mathcal D} and G : {\mathcal C} \to {\mathcal D} is defined as a morphism (in {\mathcal D}) \alpha_X for each object X in {\mathcal C}, such that the diagram below commutes for every morphism f : X \to Y in {\mathcal C}.

Screenshot 2019-08-01 at 04.52.46

I think this section on natural transformations is where the huge number of exercises in the book really came into its own; I doubt very much I would’ve grasped some of the subtleties of natural transformations without them.

We learn about two kinds of composition of natural transformations. The first combines two natural transformations on functors F,G,H : {\mathcal C} \to {\mathcal D}, say one \alpha: F \to G and one \beta: G \to H to obtain a natural transformation \beta \circ \alpha: F \to H. The other combines a natural transformation \gamma_1 between functors F_1, F_2 : {\mathcal C} \to {\mathcal D} with one \gamma_2 between functors G_1, G_2 : {\mathcal D} \to {\mathcal E}, to produce one, \gamma_2 \Diamond \gamma_1: G_1 \circ F_1 \to G_2 \circ F_2.

An equivalence relation {\mathcal C}' \simeq {\mathcal C} – weaker than isomorphism – is introduced on categories as a functor F : {\mathcal C} \to {\mathcal C}' such that there exists a functor F' : {\mathcal C}' \to {\mathcal C} and natural isomorphisms id_{\mathcal C} \xrightarrow{\cong} F' \circ F and id_{{\mathcal C}'} \xrightarrow{\cong} F \circ F'. This seemed odd to me — the lack of insistence on isomorphism for equivalence — but the examples all make sense. More importantly it is shown that this requirement is sufficient for the on-homomorphism part of the functor F to be bijective for every fixed pair of objects in {\mathcal C}, which feels right. I think my understanding could benefit from further interesting examples of usefully equivalent but non-isomorphic categories though – suggestions welcome!

Limits and Colimits

Chapter 6 is primarily given over to the definitions and discussion of limits and colimits, generalising the ideas in \mathbf{Set} to other categories. The chapter begins by looking again at product and coproduct. As an example, for a category {\mathcal C}, with objects X, Y \in \text{Ob}({\mathcal C}), a span is defined as a triple (Z, p, q) where Z is an object in {\mathcal C} and p : Z \to X and q : Z \to Y are morphisms in {\mathcal C}. A product X \times Y is then defined as as span on X and Y such that for any other span (Z, p, q) on X and Y, there exists a unique morphism t_{p,q} such that the following diagram commutes. Again, I rather like the author’s slogan for this: “none shall map to X and Y except through me!”

Screenshot 2019-10-06 at 15.36.29

The chapter then further generalises these ideas to the the ideas of limit and colimit. This section was a step up in abstraction and also one of the least “applied” parts of the book. It has certainly piqued my interest – what are limits and colimits in various structures I might use / encounter in my work? – but I found the examples here fairly limited and abstract compared to the copious ones in previous chapters. It’s in this chapter that universal properties begin to rear their head in a major way, and again I’m still not totally intuitively comfortable with this idea – I can crank the handle and answer the exercises, but I don’t feel like I’m yet at home with the idea of universal properties.

Categories at Work

In the final section of the book, we first study adjunctions. Given categories {\mathcal B} and {\mathcal A}, an adjunction is defined as a pair of functors L: {\mathcal B} \to {\mathcal A} and R: {\mathcal A} \to {\mathcal B} together with a natural isomorphism \alpha whose component for A \in \text{Ob}({\mathcal A}), B \in \text{Ob}({\mathcal B}) is \alpha_{B,A}: \text{Hom}_{\mathcal A}(L(B), A) \xrightarrow{\cong} \text{Hom}_{\mathcal B}(B, R(A)). Several preceding results and examples can then be re-expressed in this framework, and understood as examples of adjunctions. Adjuctions are also revisited later in the chapter to show a close link with monads.

Categories of functors are discussed, leading up to a discussion of the Yoneda lemma. This is really interesting stuff, as it seems to cross huge boundaries of abstraction. Consider a category {\mathcal C}. Spivak first defines a functor Y_c := \text{Hom}_{\mathcal{C}}(c, -) : {\mathcal C} \to \mathbf{Set} sending objects in {\mathcal C} to the set \text{Hom}_{\mathcal C}(c,d) and similarly for morphisms. This is then used in the following theorem:

Let {\mathcal C} be a category, c \in \text{Ob}(\mathcal{C}) be an object, and I: {\mathcal C} \to \mathbf{Set} be a set-valued functor. There is a natural bijection \text{Hom}_{\mathcal{C}-\mathbf{Set}}(Y_c, I) \xrightarrow{\cong} I(c).

This looks super weird: the right-hand side is just a plain old set and yet the left hand-side seems to be some tower of abstraction. I worked through a couple of examples I dreamt up, namely \mathcal{C} = \mathbf{1} (the category with one object) with I mapping that object to an arbitrary set, and \mathcal{C} = \mathbf{Set} with I = \text{id}_{\mathbf{Set}} and convinced myself this works just fine for those two examples. Unfortunately, the general proof is not covered in the book – readers are referred to Mac Lane – and while some discussion is provided regarding links to databases, I felt like I need a proof to properly grasp what’s going on here under the hood.

Monads are then covered. I’ve briefly seen monads in a very practical context of functional programming, specifically for dealing with I/O in Haskell, but I’ve never looked at them from a theoretical perspective before.

A monad on \mathbf{Set} (monads are only defined in the book on \mathbf{Set}, for reasons not totally apparent to me) is defined as a triple (T, \eta, \mu) consisting of a functor T: \mathbf{Set} \to \mathbf{Set}, a natural transformation \eta: \text{id}_{\mathbf{Set}} \to T, and natural transformation \mu: T \circ T \to T such that the following diagrams commute:

 

Screenshot 2019-12-08 at 14.52.30Screenshot 2019-12-08 at 14.52.41Screenshot 2019-12-08 at 14.52.50

I found this discussion enlightening and relatively straight-forward. Firstly, exception monads are discussed, like Maybe in Haskell, and a nice example of wrapping sets and their functions into power sets and their functions, for which I again enjoyed the exercises. Kleisli categories of monads are defined, and it then becomes apparent why this is such a powerful formalism to think about concepts like program state and I/O.

The Kleisli category \mathbf{Kls}(\mathcal{T}) of a monad \mathcal{T} = (T, \eta, \mu) on \mathbf{Set} is defined as a category with objects \text{Ob}(\mathbf{Kls}(\mathcal{T}) = \text{Ob}(\mathbf{Set}) and morphisms \text{Hom}_{\mathbf{Kls}(\mathcal{T})}(X,Y) = \text{Hom}_{\mathbf{Set}}(X,T(Y)). Identity is \text{id}_X: X \to X in \mathbf{Kls}(\mathcal{T}) is exactly \eta_X in \mathbf{Set}  and composition of morphisms f: X \to Y and g: Y \to Z in \mathbf{Kls}(\mathcal{T}) is given by \mu_z \circ T(g) \circ f in \mathbf{Set}. You can intuitively see the ‘baggage’ of T being carried through in \mathbf{Set} yet hidden from view in \mathbf{Kls}.

Finally, the book ends with a (very) brief discussion of operads. I’ve not come across operads before, and from this brief discussion I’m not really sure what they ‘buy’ me beyond monoidal categories.

General

The exercises are copious and great, especially in the earlier chapters. They kept me engaged throughout and helped me to really understand what I was reading, rather than just think I had. In the later exercises there are many special cases that turn out to be more familiar things. This is both enlightening “oh, an X is just a Y!” and frustrating “yeah, but I want to see Xs that aren’t Ys, otherwise what’s this all heavyweight general machinery for?” I guess this comes from trying to cover a lot of ground in one textbook.

I am struck by just how much mental gymnastics I feel I had to go through to see mathematical structures from the “outside in” as well as from the “inside out”, and how category theory helped me do that. This is a pleasing feeling.

The slogans are great and often amusing, e.g. when describing a functor from the category of Monoids to the category of Sets, “you can use a smartphone as a paperweight.”

The book ends rather abruptly after the brief discussion of operads, with no conclusion. I felt it would be nice to have a summary discussion, and perhaps point out the current directions the field of applied category theory is taking.

Overall, this book has been an enlightening read. It has helped me to see the commonality behind several disparate mathematical structures, and has encouraged me to look for such commonality and relationships in my own future work. Recommended!

Machine Learning at FPT 2019

Next week, the IEEE International Conference on Field-Programmable Technology (FPT) will take place in Tianjin in China. I’m proud that my former PhD student Qiang Liu will be General Chair of the conference.

I am a coauthor of two papers to be presented at FPT, one led by my former BEng student Aaron Zhao, now a PhD student at Cambridge supervised by my colleague Rob Mullins, and one led by my former postdoc, Ameer Abdelhadi, now with COHESA / UofT. The paper by Aaron is also in collaboration with two of my former PhD students, Xitong Gao, now with the Chinese Academy of Sciences, and Junyi Liu, now with Microsoft Research.

The first paper, led by Aaron, is entitled ‘Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs’, and can be found on arXiv here. This paper is a serious look at getting an automated CNN flow for FPGAs that makes good use of some of the arithmetic flexibility available on these devices. Powers-of-two (“free” multiplication) and fixed-point (“cheap” multiplication) are both leveraged.

The second paper, led by Ameer, looks at the computation of a set of approximate nearest neighbours. This is useful in a number of machine learning settings, both directly as a non-neural deep learning inference algorithm and indirectly within sophisticated deep learning algorithms like Neural Turing Machines. Ameer has shown that this task can be successfully accelerated in an FPGA design, and explores some interesting ways to parameterise the algorithm to make the most of the hardware, leading to tradeoffs between algorithm accuracy and performance.

If you’re at FPT this year, please go and say hello to Aaron, Ameer and Qiang.

Highlights of Asilomar 2019

This week, I attended my first Asilomar Conference on Circuits, Signals and Computers, a very long-running conference series of the IEEE Signal Processing Society, with a very broad range of topics. I decided to attend Asilomar after being invited to give not just one talk, but two, once by my friend and collaborator Miloš Ercegovac from UCLA, and once by my good colleague Zhiru Zhang from Cornell.

No discussion of highlights of Asilomar can go without pointing out the extraordinarily beautiful setting of a conference centre right on Asilomar Beach. I can certainly see why the conference organisers keep coming back year after year – since the 1970s for Miloš and even earlier for my old friend fred harris, who I met there by surprise.

IMG_4500.jpg

 

Distinguished Lecture

The conference opened with distinguished lecture by Helmut Bölcskei from ETH Zurich, who gave a wonderful talk about the fundamental limits of deep learning. The key results he presented were about neural networks built of linear computational units and ReLU functions, and he showed how they can approximate a range of different functions. I was already familiar with asymptotic results for infinite depth or infinite width networks, but Bölcskei’s results were different – they showed how the approximation quality can be traded against a metric of neural network complexity that captured the number of bits needed to store the topology and the weights of the network. He was able to show the power of such neural networks across an extremely broad class of functions, and to explain how this comes about.

Compilation for Spatial Computing Architectures

This session was organised by Zhiru Zhang from Cornell and Hongbo Rong from Intel. The first talk, given by Yi-Hsiang Lai from Cornell, described the HeteroCL infrastructure, about which I’ve previously blogged in my description of FPGA 2019. Very closely related to this was Hongbo’s own work at Intel Labs, which makes heavy use of polyhedral methods, and work from the systolic array community on affine and uniform recurrence equations.

I then gave a talk about some of the work my research group has been doing over the past 12+ years in analysis of memory access patterns for High-Level Synthesis, taking in my early foundational work in bringing the polyhedral model to HLS with Qiang Liu (now at Tianjin University), our work on Separation Logic in HLS (now also a book by Felix Winterstein, my former PhD student who leads Xelera Technologies), and our recent work on utilising Microsoft Boogie in this context for multi-threaded HLS by my current PhD student Jianyi Cheng.

IMG_9262

Finally, Thierry Moreau from the University of Washington presented his very interesting work on a hardware-software open-source stack for modern deep learning (see the TVM website).

Computer Arithmetic

This session was organised by Miloš Ercegovac from UCLA and Earl Swartzlander from UT Austin. The first talk in this session was from Fredrik Dahlqvist, a postdoc in my group, who spoke about our work together with Rocco Salvia marrying ideas from probabilistic programming with rounding error analysis.

IMG_4510

Miloš Ercegovac from UCLA and James Stine from Oklahoma State University looked at how digit iteration techniques for division compare to multiplication-based techniques. Alexander Groszewski and Earl Swartzlander from UT Austin discussed their results from deterministic unary arithmetic inspired by stochastic computing; Keshab Parhi from the audience raised the interesting point of the importance of preservation of temporal structure in specially designed deterministic sequences for purposes of compositionality.

I really enjoyed the unusual talk by Keshab Parhi (U. Minnesota) on Molecular Computing Inspired by Stochastic Logic (see here for more details) via Fractional Coding, building on Soloveichik, Seelig and Winfree. If digits are encoded as relative concentrations of molecules, the problem of signal correlation, which tends to take the shine off stochastic computing work, can be avoided. He proposed computation using molecular reaction rates, and showed how to encode values as concentrations of two different molecules; his techniques have been verified in simulation – I would love to see this in a test-tube.

Theory of Deep Learning

This session was organised by Richard Baraniuk and Santiago Segarra (Rice University.)

There was a very enjoyable talk by Alessandro Achille from UCLA on studying deep neural networks from an information-theoretic perspective. He pointed out that real-valued weights appear to contain infinite information, but that by using the principle that small perturbations in weights should not throw-off the classification result completely, we can recover a finite weight encoding. He then moved on to show using a PAC-Bayes bound that good generalisation comes from low weight information. He demonstrated that Stochastic Gradient Descent implicitly minimises Fisher information, but that for generalisation performance, it is Shannon information that should be bounded – he then derived a connection between the two under some conditions.

Tom Goldstein (University of Maryland) gave a stunningly illustrated talk on Understanding Generalization in Neural Nets via Visualization, based on his co-authored paper on the topic. He sought to empirically understand how the continuous piecewise linear functions of modern DNNs, when combined with SGD-based optimisation, lead to functions that generalise well. This was done via a clever process of “poisoning” training data to obtain badly generalising minima.

AI/ML Architectures

This session was organised by Keshab Parhi (University of Minnesota.)

Danny Bankman gave a talk about Stanford’s RRAM-based DNNs. He showed that register-file access accounts for the majority of energy in standard CMOS processor-like architectures, and drew the conclusion that architectures should be “memory-like” in their design, using “conductance-mode arithmetic” with very low precision integer activations, and put the necessary ramp generator for their ADC right inside the RRAM array. Results are verified using SPICE. I know little about RRAM technology, but talking with my colleagues Themis Prodromakis and Tony Kenyon has got me intrigued.

Deep Learning Theory

This session was organised by Tom Goldstein (University of Maryland.)

My favourite talk in this session was by Tom himself, in which he presented an analysis of adversarial attacks in DNNs, again beautifully illustrated – based on his co-authored paper. He showed that due to the high dimensionality of the spaces involved, you are extremely likely to hit – at random – a point in the input space that can be adversarially perturbed. He demonstrated – using the audience as guinea pigs – that adversarial perturbation can also trick humans quite easily on the CIFAR-10 data set. Perhaps my favourite twist on the talk was that he gave the talk wearing an “invisibility cloak” which – if worn – tricks YOLO into not identifying the wearer.

Reflections on Asilomar

I’ve sent PhD students to Asilomar before, but this was the first time I attended myself. It’s a very broad conference, in a beautiful setting. It seems to be a great venue to complement the more technically homogeneous conferences like FPGA which I help to organise – they serve different purposes. Asilomar is a great conference to have your work seen by people who wouldn’t usually follow your work, and to pick up ideas from neighbouring fields.

Approximating Circuits

Next week, Ilaria Scarabottolo, currently a visiting research student in my research group at Imperial, will present her paper “Partition and Propagate” at DAC 2019 in Las Vegas. In this post, I will provide a brief preview of her work (joint with Giovanni Ansaloni and Laura Pozzi from Lugano and me.)

I’ve been interested in approximation, and how it can be used to save resources, ever since my PhD 20 years ago, where I coined the term “lossy synthesis” to mean the synthesis of a circuit / program where error can be judiciously introduced in order to effect an improvement in performance or silicon area. Recently, this area of research has become known as “approximate computing“, and a bewildering number of ways of approximating behaviour – at the circuit and software level – have been introduced.

Some of the existing approaches for approximate circuit synthesis are point solutions for particular IP cores (e.g. our approximate multiplier work) or involve moving beyond standard digital design methodologies (e.g. our overclocking work.) However, a few pieces of work develop a systematic method for arbitrary circuits, and Ilaria’s work falls into this category.

Essentially, she studies that class of approximation that can be induced solely by removing chunks of a logic circuit, replacing dangling nets with constant values – a technique my co-authors referred to as Circuit Carving in their DATE 2018 paper.

Our DAC paper presents a methodology for bounding the error that can be induced by performing such an operation. Such error can be bounded by exhaustive simulation or SAT, but not for large circuits with many inputs due to scalability concerns. On the other hand, coarse bounds for the error can be derived very quickly. Ilaria’s work neatly explores the space between these two extremes, allowing analysis execution time to be traded for bound quality in a natural way.

Approximation’s time has definitely come, with acceptance in the current era often driven by machine-learning applications, as I explore in a previous blog post. Ilaria’s paper is an interesting and general approach to the circuit-level problem.