# 2016 SATS: Scaled Scores

On the 3rd of June England’s Department for Education released information about how to turn children’s answers in their KS1 tests into a scaled score. I am profoundly disappointed by the inconsistency between this document and the fanfare produced over the abolition of levels.

By introducing these scaled scores, the DfE has produced a new level of achievement known in their paper as “the expected standard on the test”. Note that this is quite a different thing to “the expected standard” defined in the Interim Teacher Assessment Frameworks at the End of KS1. Confused? You should be.

When moving from a level-based “best fit” assessment to the new assessment framework (see my earlier blog post for my concerns on this), a key element of the new framework was that a pupil is only assessed as having met the expected standard if they have attained “all of the statements within that standard and all the statements in the preceding standards” (boldface in original). As schools up and down the country struggle to produce systems capable of tracking pupil progress, I’ve been waiting to see how the Department intends to square this assessment approach with testing. Now the answer is in: they don’t.

Let me explain why. To simplify matters, let’s look at a stripped down version of the “expected standard” for KS1 mathematics. Let’s imagine it just consists of the first two statements:

• The pupil can partition two-digit numbers into different combinations of tens and ones. This may include using apparatus (e.g. 23 is the same as 2 tens and 3 ones which is the same as 1 ten and 13 ones).
• The pupil can add 2 two-digit numbers within 100 (e.g. 48 + 35) and can demonstrate their method using concrete apparatus or pictorial representations.

Leaving aside the apparatus question (guidance here states that children were not allowed apparatus in the test, so quite how that’s supposed to measure the expected standard is a mystery), the question remains – how do you convert assessment of each individual strand into an assessment of whether the expected standard is met. Let’s assume our test has a question to test each statement. The teacher assessment guidance is straightforward, if flawed: assess each strand individually and only if all strands have been reached has the “expected standard” been reached. Translating this into our imaginary test, this would mean: mark each question individually, and only if all questions are above their individual pass mark, the standard has been met. Is this the approach taken? Not at all. The approach taken is exactly that used under levels: add up all the marks for all the questions, and if the total is above a threshold then the “expected standard on the test” has been met, i.e. it is a best fit judgement. Yes, that’s right, exactly the kind of judgement railed against by the Department for Education and the Commission on Assessment without Levels – we are back to levels. For better or for worse.

The total mismatch between the approach enforced in testing and the approach enforced in teacher assessment has obviously been spotted by the DfE because they themselves say:

The tests are also compensatory: pupils score marks from any part of the test and pupils with the same total score can achieve their marks in different ways. The interim teacher assessment frameworks are different.

Frankly, this is a mess.

Key Stage 2 test results are out on the 5th of July. I expect a similar approach then, except this time those results form the basis of the school accountability system.

The NAHT is quite right to call for school level data from the flawed 2016 assessments not to be used for external purposes and to question the whole approach of “secure fit”.

# KAPow: Online Instrumentation of Power

Great news from the IEEE International Symposium on Field-Programmable Custom Computing Machines this week – our paper KAPow: A System Identification Approach to Online Per-module Power Estimation in FPGA Designs won the best paper award!

KAPow is all about trying to understand where your FPGA design is burning power, at run time, so that some higher level control entity can make intelligent decisions based on that information.

The problem is that we just don’t have the power sensors – we only know the total power consumed by the device, not the power consumed by each module in the design. So what to do?

Enter KAPow. Our tool (soon to be released publicly as an output from the PRiME project) will take RTL, instrument it and return back RTL that estimates its own power consumption.

The idea is fairly straight-forward: Automatically identify at synthesis (compile) time a subset of nets that you think are going to have a good correlation with power consumption of a module. Insert cheap counters to monitor their activity at run-time, and then use an online recursive least squares model to adaptively learn a model of power consumption of each module, based on the overall power consumption of the chip.

Results are good: power estimates are good down to about 5mW, and the infrastructure itself adds less than 4% to the total power consumed by the device.

Now, of course, the question is what to do with all this information that can now be collected at run-time. Watch this space!

# Super Fast Loops

On Tuesday, my PhD student Junyi Liu presented our work (joint with John Wickerson) on Loop Splitting for Efficient Pipelining in High-Level Synthesis to the assembled audience at the IEEE International Symposium on Field-Programmable Custom Computing Machines in Washington DC.

A primary way in which FPGA applications tend to get their blindingly fast performance is through overlapping loop iterations in time – known variously as loop pipelining or software pipelining. These days, you can expect high-level synthesis tools to do this for you. Sometimes.

Unfortunately there are cases where the tools can’t get squeeze out performance. This paper addresses two such cases in a unified framework. The first is the case where some pesky loop iterations get in the way. Consider this trivial example:

```
for( int i=0; i&amp;lt;N; i++ )
A[2*i] = A[i] + 0.5f;

```

In this case the early loop iterations are the problematic ones because A[2] depends on A[1], A[4] on A[2], etc. These tight dependences hinder pipelining, leading existing HLS tools to throw in the towel.

The second case is where there are loop invariant parameters that are not known until the loop executes. Consider the case:

```
for( int i=0; i&amp;amp;lt;N; i++ )
A[i+m] = A[i] + 0.5f;

```

Without knowing the value of m at compile time, the dependence structure is unknown – we might have no read-after-write dependences, tight read-after-write dependences, or the dependences might be so many iterations away that we just don’t care and can pipeline away to our heart’s content. A limited version of this latter issue was addressed in Junyi’s earlier paper, the predecessor of this work.

In the new FCCM 2016 paper, we show that both these cases can be analysed using a parametric polyhedral framework, and show that we can automatically derive source-to-source transformations to significantly accelerate the loops in these cases. The end result? A push button approach that could gain you a factor of more than 4x in performance if your pipelining is being stymied by pesky dependences.

# FPGA 2016: Some Highlights

I’ve just returned from the ACM International Symposium on FPGAs, held – as usual – in sunny Monterey, California. This year I was Finance Chair of the conference, which meant I had less running about to do than the last two years, when I was Programme Chair in 2014 and then General Chair in 2015. I therefore took my time to enjoy the papers presented, which were generally of a high quality. Intel’s acquisition of Altera last year provided an interesting backdrop to the conference, and genuinely seems to have fired the community up, as more and more people outside the FPGA “usual suspects” are becoming very interested in the potential for this technology.

Below, I provide my impressions of the highlights of the conference, which necessarily form a very biased view based on what I find particularly interesting.

There were a couple of very interesting papers from Altera on their Stratix 10 device. These devices come with customisable clock trees, allowing you to keep the clock distribution local to your clock regions. The results showed that for regions consisting of fewer than 1.6M LEs, it was better to use a configurable clock region rather than any fixed clock region, as the latter needs greater margining. In a separate paper, Dave Lewis presented a detailed look at the Stratix 10 pipelined routing architecture, which gave a good insight into industrial architecture exploration. They had explored a range of circuits and tried to identify the maximum retiming performance that could be achieved around loops as a function of the number of pipelining registers they need to insert in the routing muxes. The circuit designs were discussed, but for me the most interesting thing about this innovation is the way it changes the tools: focus can be placed on P&R for loops rather than feed-forward portions of the design in the case that the design can tolerate latency; this is exactly where tools should be focusing effort. The kind of timing feedback to the user also changes significantly, and for the better.

Zgheib et al from Paolo Ienne’s group at EPFL presented their FPRESSO tool available at fpresso.epfl.ch which harnesses standard cell tools to explore FPGA architectures. Their tool is open source, and this paper won the best paper prize (becoming something of a habit for Paolo!). This tool could be an interesting basis for future academic architecture exploration.

Safeen Huda from Jason Anderson’s group at U of T presented an interesting suggestion for how suppress glitches for power reasons in FPGA circuits in the presence of PVT variation.

Davis demonstrated our work on run-time estimation of power on a per-module basis, part of the PRiME project, at the relevant poster session, and I was pleased by the number of people who could see the value in run-time monitors for this purpose.

High Level Tools

Gao’s talk from my group on automatic optimisation of numerical code for HLS using expression rewriting was very well received, especially since he has made his tool available online at https://admk.github.io/soap/. We would be delighted to receive feedback on this tool.

Ramanathan’s presentation from my group on the case for work-stealing on FPGAs using atomic memory operations in OpenCL also seemed to generate quite a buzz at the discussion session afterwards.

Applications

There were quite a few application papers this year targeting Convolutional Neural Networks (CNNs), the application of the moment. Both of the full papers in this area (from Tsinghua and from Arizona State) emphasised the need to use low precision fixed-point datapath, an approach I’ve been pushing in the FPGA compute space for the last 15 years or more. This application seems to be particularly suited to the problem, allowing computation with little impact on classification down as low as 8 bits. The work from Tsinghua university also took advantage of an SVD approach to reduce the amount of compute required. I think there’s some promise to combine this with the fixed-point quantization, as first pointed out by Bouganis in his FCCM 2005 paper.

Overlays

The conference was preceded by the OLAF workshop run by John Wawrzynek and Hayden So. I must admit that I am not a huge fan of the idea of a hardware-implemented FPGA overlay architecture. I can definitely see the possible advantages of an overlay architecture as a conceptual device, a kind of intermediate format for FPGA compilation. I find it harder to make a compelling case for implementing that architecture in the actual hardware. However, if overlay architectures make programming FPGAs easier in those hard-to-reach areas (until we’ve caught up with our HLS technology!) and therefore expand the user base, then bring it on! A bit like floating point, in fact!

# Stealing Work for Your FPGA

Tomorrow, my PhD student Nadesh Ramanathan gives his first conference presentation, at the ACM International Symposium on FPGAs, claiming a place for work-stealing on FPGAs (joint work with John Wickerson and Felix Winterstein). The short paper on which the presentation is based can be found here.

Nadesh argues that we should pursue lock free approaches to load balancing for FPGAs, and shows that this can be implemented within Altera’s OpenCL framework. Initial work from an efficient K-means clustering algorithm which manipulates dynamic data structures demonstrates that this approach shows promise for the future. As we move to put more and more irregular applications on FPGAs, this kind of methodology will become increasingly important.

# Fields Can Make Your Loops Run Faster

On Tuesday, my PhD student Xitong Gao will present our work (joint with John Wickerson) on Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis at the ACM International Symposium on Field-Programmable Gate Arrays.

Our starting point for this work is that people often want to express their algorithms in terms of real numbers, but then want to have an implementation is some finite precision arithmetic on an FPGA after passing the results through a High-Level Synthesis tool.

One of the key limiting factors of the performance of an algorithm is the ability of the HLS tool to pipeline loops in the design in order to maximise throughput. It turns out that numerical properties of real numbers can come to the rescue!

It’s a well known (but often forgotten) fact that computer-represented numbers are not – in fact are very far from – real numbers. In particular, the main equivalence properties that the real numbers exhibit: closure, associativity, commutativity, distributivity, etc. simply do not apply for a large number of practically used data representations.

In our paper we take account of this discrepancy by automatically refactoring code to make it run faster (while tracking area and accuracy too). As a trivial example for the recurrence $x_n = 3 x_{n-1} + 1$, a multiply and an add must be performed before executing the next iteration. But transforming this to $x_n = 9 x_{n-2} + 4$ gives much more time to execute. Combining this with expression balancing, etc., leads to a wide variety of possible implementations, which our tool can explore automatically while also proving the numerical properties of the transformed code. This latter point makes it quite unlike so-called “unsafe” optimisations commonly used in compilers such as -funsafe-math-optimizations in gcc.

# HiPEAC 2016

I have just returned from HiPEAC, a major “roadshow” of primarily European research in high performance and embedded computing. As always, it was good to catch up with old friends and colleagues from around the world. I briefly describe below some of the highlights for me, as well as my contributions as part of a panel discussion held on  Tuesday.

The keynote talk at PARMA-DITAM from my old friend Pedro Diniz on compiler transformations for resilience. A theme that kept coming up throughout the conference, and is also a key topic of the large project I have with Southampton, Manchester and Newcastle Universities.

Kirsten Eder’s talk at the ICT-ENERGY workshop on the work at Bristol on modelling power consumption in the XMOS architecture. An interesting choice, it turns out, because the communication in XMOS is transparent to the programmer, which makes power modelling at instruction level much more appealing.

Another old friend, Juergen Teich, gave a great keynote at IMPACT on extending scheduling to the parametric / symbolic case, in order to allow for the number and shape of a rectangular array of processing units to be determined at run time rather than at compile time. This aligns very nicely with our own drive to parametricity, in our case for run-time pipelining decisions in high-level synthesis.

Cristiano Malossi from IBM gave an interesting keynote at WAPCO on mixed precision conjugate gradient solvers, amongst other things. My former PhD student Antonio Roldao Lopes did an early investigation into how the precision of such a solver impacts its performance (lower precision, more iterations, but each iteration runs faster). It was interesting to see this extended to a mixed precision setting, where expensive parts of the algorithm are executed in low precision while cheap parts are executed in high precision.

I was invited to participate in a panel discussion in front of the audience at the WRC workshop. The other panelists were Aaron Smith, who is writing compilers for the Microsoft FPGA system now in the Bing search engine, Ronan Keryell, who does high-level design R&D at Xilinx, Peter Hofstee, from IBM, and Koen Bertels from TU Delft. The panel was moderated by another old friend, Juergen Becker. We were asked to discuss whether “the embedded revolution is dead” or whether it’s “just the microprocessor” whose death has come (!), and what the future holds. My perspective was that the embedded revolution has not started yet (how much computational intelligence is embedded in most things around us, and how much better could our lives be if there were more?), that the microprocessor is far from dead (though future microprocessors will look far more ‘FPGA-like’ as the number of cores expand beyond the point at which it becomes unreasonable to ignore the fact we’re computing in space), and that the future belongs to high performance embedded systems based on heterogeneous architectures. I don’t think there was much disagreement from other members of the panel. Juergen did his tongue-in-cheek best to start an IBM-and-Microsoft-versus-others fight by suggesting that it was actually datacentres which are dead, though I think there is a general consensus that the balance between local and cloud-based compute needs to be carefully evaluated in future system designs and that neither will kill the other.

Questions from the audience were interesting because people seem mostly worried about how to program these damn things! Which is good for those of us working on exactly this question, of course. The panel’s view was that with today’s compiler technology, the two main options were Domain Specific Languages and OpenCL. I made the point that architecture specialisation, and FPGA design in particular, is naturally aligned with static analysis – you get automatic specialisation by automatically understanding what you’re going to implement; I therefore ventured that future languages for heterogeneous computing, whatever they are, will be designed to make static analysis a simpler task.

# Book Review: Out of the Labyrinth: Setting Mathematics Free

This book, by Kaplan and Kaplan, a husband and wife team, discusses the authors’ experience running “The Math Circle”. Given my own experience setting up and running a math circle with my wife, I was very interested in digging into this.

The introductory chapters make the authors’ perspective clear: mathematics is something for kids to enjoy and create independently, together, with guides but not with instructors. The following quote gets across their view on the difference between this approach and their perception of “school maths”:

Now math serves that purpose in many schools: your task is to try to follow rules that make sense, perhaps, to some higher beings; and in the end to accept your failure with humbled pride. As you limp off with your aching mind and bruised soul, you know that nothing in later life will ever be as difficult.

What a perverse fate for one of our kind’s greatest triumphs! Think how absurd it would be were music treated this way (for math and music are both excursions into sensuous structure): suffer through playing your scales, and when you’re an adult you’ll never have to listen to music again.

I find the authors’ perspective on mathematics education, and their anti-competitive emphasis, appealing. Later in the book, when discussing competition, Math Olympiads, etc., they note two standard arguments in favour of competition: that mathematics provides an outlet for adolescent competitive instinct and – more perniciously – that mathematics is not enjoyable, but since competition is enjoyable, competition is a way to gain a hearing for mathematics. Both perspectives are roundly rejected by the authors, and in any case are very far removed from the reality of mathematics research. I find the latter of the two perspectives arises sometimes in primary school education in England, and I find it equally distressing. There is a third argument, though, which is that some children who don’t naturally excel at competitive sports do excel in mathematics, and competitions provide a route for them to be winners. There appears to be a tension here which is not really explored in the book; my inclination would be that mathematics as competition diminishes mathematics, and that should competition for be needed for self-esteem, one can always find some competitive strategy game where mathematical thought processes can be used to good effect. However, exogenous reward structures, I am told by my teacher friends, can sometimes be a prelude to endogenous rewards in less mature pupils. This is an area of psychology that interests me, and I’d be very keen to read any papers readers could suggest on the topic.

The first part of the book offers the reader a detailed (and sometimes repetitive) view of the authors’ understanding of what it means to do mathematics and to learn mathematics, peppered with very useful and interesting anecdotes from their math circle. The authors take the reader through the process of doing mathematics: analysing a problem, breaking it down, generalising, insight, and describe the process of mathematics hidden behind theorems on a page. They are insistent that the only requirement to be a mathematician is to be human, and that by developing analytical thinking skills, anyone can tackle mathematical problems, a mathematics for The Growth Mindset if you will. In the math circles run by the authors, children create and use their own definitions and theorems – you can see some examples of this from my math circle here, and from their math circles here.

I can’t say I share the authors’ view of the lack of beauty of common mathematical notation, explored in Chapter 5. As a child, I fell in love with the square root symbol, and later with the integral, as extremely elegant forms of notation – I can even remember practising them so they looked particularly beautiful. This is clearly not a view held by the authors! However, the main point they were making: that notation invented by the children, will be owned and understood by the children, is a point well made. One anecdote made me laugh out loud: a child who invented the symbol “w” to stand for the unknown in an equation because the letter ‘w’ looks like a person shrugging, as if to say “I don’t know!”

In Chapter 6, the authors begin to explore the very different ways that mathematics has been taught in schools: ‘learning stuff’ versus ‘doing stuff’, emphasis on theorem or emphasis on proof, math circles in the Soviet Union, competitive versus collaborative, etc. In England, in my view the Government has been slowly shifting the emphasis of primary school mathematics towards ‘learning stuff,’ which cuts against the approach taken by the authors. The recent announcement by the Government on times tables is a case in point. To quote the authors, “in math, the need to memorize testifies to not understanding.”

Chapter 7 is devoted to trying to understand how mathematicians think, with the idea that everyone should be exposed to this interesting thought process. An understanding of how mathematicians think (generally accepted to be quite different to the way they write) is a very interesting topic. Unfortunately, I found the language overblown here, for example:

Instead of evoking an “unconscious,” with its inaccessible turnings, this explanation calls up a layered consciousness, the old arena of thought made into a stable locale that the newer one surrounds with a relational, dynamic context – which in its turn will contract and be netted into higher-order relations.

I think this is simply arguing for mathematical epistemology as akin to – in programming terms – summarizing functions by their pre and post conditions. I think. Though I can’t be sure what a “stable locale” or a “static” context would be, what “contraction” means, or how “higher order relations” might differ from “first order” ones in this context. Despite the writing not being to my taste, interesting questions are still raised regarding the nature of mathematical thought and how the human mind makes deductive discoveries. This is often contrasted in the text to ‘mechanical’ approaches, without ever exploring the question of either artificial intelligence or automated theorem proving, which would seem to naturally arise in this context. But maybe I am just demonstrating the computing lens through which I tend to see the world.

The authors write best when discussing the functioning of their math circle, and their passion clearly comes across.

The authors provide, in Chapter 8, a fascinating discussion of the ages at which they have seen various forms of abstract mathematical reasoning emerge: generalisation of when one can move through a 5×5 grid, one step at a time, visiting each square only once, at age 5 but not 4; proof by induction at age 9 but not age 8. (The topics, again, are a far cry from England’s primary national curriculum). I have recently become interested in the question of child development in mathematics, especially with regard to number and the emergence of place value understanding, and I’d be very interested to follow up on whether there is a difference between this between the US, where the authors work, and the UK, what kind of spread is observed in both places, and how age-of-readiness for various abstractions correlates with other aspects of a child’s life experience.

Other very valuable information includes their experience on the ideal size of a math circle: 5 minimum, 10 maximum, as they expect children to end up taking on various roles “doubter, conjecturer, exemplifier, prover, and critic.” If I run a math circle again, I would like to try exploring a single topic in greater depth (the authors use ten one hour sessions) rather than a topic per session as I did last time, in order to let the children explore the mathematics at their own rate.

The final chapter of the book summarises some ideas for math circle style courses, broken down by age range. Those the authors believe can appeal to any age include Cantorian set theory and knots, while those they put off until 9-13 years old include complex numbers, solution of polynomials by radicals, and convexity – heady but exciting stuff for a nine year old!

I found this book to be a frustrating read. And yet it still provided me with inspiration and a desire to restart the math circle I was running last academic year. Whatever my beef with the way the authors present their ideas, their core love – allowing children to explore and create mathematics by themselves, in their own space and time – resonates with me deeply. It turns out that the authors run a Summer School for teachers to learn their approach, practising on kids of all ages. I think this must be some of the best maths CPD going.

# Sizing Custom Parallel Caches

On Wednesday, my PhD student Felix Winterstein will present his paper Custom-Sized Caches in Application-Specific Memory Hierarchies at FPT 2015 in New Zealand. This is joint work with MIT and Intel, specifically the team who developed LEAP.

Imagine that your FPGA-based algorithm needs to access data. Lots of data. What do you do? Store that data in SDRAM and spend ages carefully thinking about how to make efficient use of scarce on-chip memory to get the best performance out of your design. What if that whole process could be automated?

Felix has been working for a while on a method to do just that. He’s been able to produce some incredibly exciting work, making use of the latest developments in software verification to devise an automatic flow that analyses heap manipulating software and how it should be efficiently implemented in hardware. At FPGA in February this year, we were able to show how our analysis could automatically design a parallel caching scheme, and know – prove, even! – when costly coherence protocols could be avoided. The remaining part of the puzzle was to automatically know – in a custom parallel caching scheme – how much on-chip memory you should allocate to which parallel cache. This is the last piece of the puzzle we will present on Wednesday.

# Book Review: Surely You’re Joking, Mr Feynman

This Summer I had a long flight to Singapore. A few hours into long flights I tend to lose the will to read anything challenging, so I decided to return to a book I first read as a young teen when it was lent to me by family friend fred harris, the book Surely You’re Joking, Mr Feynman. I remember at the time that this book was an eye opener. Here was a Nobel prize winning physicist with a real lust for life coupled with an inherent enthusiasm for applying the scientific method to all aspects of life and a healthy lack of respect for rules and authority. As a teenager equally keen to apply the scientific method to anything and everything, and keen to work out my own ideas on authority and rules, yet somewhat lacking the personal confidence of Feynman, I found the book hugely enjoyable.

I still find the book hugely enjoyable. I was chuckling throughout – the practical jokes, the encounters with non-scientific “experts” and authority figures. However, I was also somewhat taken aback by Feynman’s attitude to women, which I found misogynistic in places, something that had totally passed me by as a teenager. Feynman’s attitudes were probably unremarkable for his time (Feynman was born in 1918) so I’m not attributing blame here, but I find it odd that I didn’t notice this in the 1980s. Googling for it now, I find commentary about this point is all over the Internet. It just goes to show how many subtleties pass children and teens by, something parents and educators would do well to remember.