The fact that avr-gcc exists at all is quite amazing though. It's allowed many people to get into embedded design without needing to purchase licenses for other proprietary compilers. It might not be the most optimal compiler, but sub-optimal does not mean incorrect. If you know enough to quibble with the code, and you're seeing performance issues like this, you can just inline your assembly into C.
AVR has historically been one of the best routes to get into embedded work because there is solid open-source tooling, interfacing the hardware is simple, and it's very well documented in the open by Atmel/Microchip. Arduino would not exist without avr-gcc.
I've hoped for a while that LLVM's support for such targets would improve, but maybe the ubiquity of ARM microcontrollers makes that more unlikely now. There is an AVR target for example, but in my attempt to use Rust on an AVR I ran into several incorrectness bugs - and newer versions of LLVM are plain broken [0]. There is also an MSP430 backend but I've never had an excuse to play around with that...
Maybe the libgccjit backend for Rust will solve the specific scenario I was interested in, but it doesn't improve anything for LLVM itself.
Took a while but this issue was fixed this year. Until then, you could compile a patched version of LLVM, or use an older nightly.
And the code generation was quite broken indeed. Most importantly the saving of registers within interrupt handlers. Hard to understand the bug, but I had documented a simple inline asm solution on the issue. It has been fixed since then.
My last experience was quite good. And I didn't notice wrong codegen.
Oh hey, you're right, I went and found the issue but didn't scroll down after seeing it was still open. That's great! I might try and revive my previous project. I had tried an older nightly but that's where I think I was seeing bad codegen.
Isn't Rust designed around things like immutable variables, where you need to copy and copy and copy the same things over and over, and therefore expect to run it on something with infinite amounts of memory? It might be suitable for modern desktop computers which have hundreds or even thousands of kilobytes of memory available, but not a microcontroller with 2kB of RAM.
The projects I work on are in the 32 bit ARM space, so we have more space than the projects you're talking about, but we use a microkernel, compile all programs separately, and then put them all together at the end. Some example programs and sizes:
(Note that some of these are larger than is actually required, I got these numbers by checking in on the amount of memory space they request from the OS, which has to be a power of 2 in the current implementation, so something that's 4097 bytes ends up being 8k in these numbers.)
I use AVR Rust and have dumped the resulting assembly output and I was suitably impressed. It optimized the living daylights out of the code and the resulting assembly was amazingly short and did exactly what you'd expect. It didn't really waste any stack space or RAM. It's truly amazing to see modern rust running on an 8 bit CPU in such a reasonable way (of course it also helps to write in an 8 bit friendly way, like using u8 or i8 for things that don't need to be big), but it's also nice to know I can just use a 64 bit int and it'll do whatever it has to do to work.
You're thinking Haskell. Rust has mutability all over the place, but it has safeguards against accidentally sharing across threads without synchronization or iterator invalidation.
Haskell also doesn’t need infinite memory (like parent claimed about Rust). It has immutability but it also has a garbage collector. It tends to use less memory than Java for example (to the extent that the benchmarks game is a useful benchmark):
For tiny tiny programs that don't need much memory, Java default allocation is obviously larger. For programs that need memory to be allocated less so.
immutable variables isn’t ever a problem. compilers frequently translate programs into SSA form, and they manage to work for all platforms without memory bloat or anything.
Once you've got an optimiser none of that really matters. Clang actually goes the other way, it emits IR with stack allocated variables (alloca), but LLVM's mem2reg pass then attempts to convert those into single static assignment (SSA) form instead.
It's not fair to say that rust is 'designed around immutable variables' - rust has move semantics, but move semantics just mean that if you have a variable, 'a' and you assign it to variable 'b', the variable 'a' is now dead and can no longer be referenced. This exists at the lexical level, it's a no-op in generated code.
Is that maybe what you were referring to?
For certain scalar types, flexibility is increased by allowing 'copy' semantics where assigning 'a' into 'b' makes 'b' a copy of 'a', and both are alive. Then it ends up mattering how heavy the type is - although you can only implement 'copy' for things you can trivially memcpy, so nothing on the heap.
Generally anything that would be expensive to duplicate doesn't get 'copy' semantics, but instead requires you to move it into a new variable, or explicitly clone it.
Rust also has immutable-by-default semantics, but it's by default. You can mutate the contents of structs, but there can only be one mutable reference XOR an arbitrary quantity of read-only references and aliasing is not permitted. This forms the basis for many of the safety guarantees.
Did that help? I was guessing at what you meant, so if that wasn't it I can always try again.
[edit] within the context of microcontrollers Rust requires you to be very explicit about what is and isn't permitted, and how things should work. You can disallow in your construction pretty much anything expensive or non-trivial.
So really anything on an AVR8 that isn't either an 8- or 16-bit int, unsigned or signed, is going to be complete and utter monster to deal with.
Natively the CPU deals with 8-bit values. Obviously that's a little cramped so you can just about get away with using a few more instructions to do 16-bit. If you absolutely must, 32-bit ints aren't horrible to cope with, but then you start to get into a lot of unnecessary code when you want to change size.
Even a very high level language like C is a bad idea on something so constrained, because C assumes that everything is a massive approximately VAX-like architecture with mappable memory all over the place, and limitless amounts of it, possibly as much as one or two megabytes.
> Even a very high level language like C is a bad idea on something so constrained, because C assumes that everything is a massive approximately VAX-like architecture with mappable memory all over the place, and limitless amounts of it, possibly as much as one or two megabytes.
That's not really true for the AVR family; the instruction set and general architecture was designed with C in mind. Unlike say a PIC microcontroller, the AVR family has a hardware stack pointer (SPH/SPL) and a large number of 8-bit registers which can also be referenced in 16-bit pairs for the (albeit limited) set of instructions which support it.
C makes some assumptions (for instance assuming the existence of a stack pointer), but the AVR designers kept that stuff in mind. Pretty sure AVR C actually uses ILP16 data model, not an 8-bit model as you may be expecting.
C doesn't make assumptions about the size of an address space, although you can when specifying the data model for your architecture.
The only thing that's slightly less than clean programming AVRs using C is that they're Harvard architecture instead of Von Neumann so you have to access program memory via special instructions (lpm/elpm/spm). That's wrapped with a __attribute__((progmem)) specifier in AVR GCC so the compiler knows it uses a different address space.
> So really anything on an AVR8 that isn't either an 8- or 16-bit int, unsigned or signed, is going to be complete and utter monster to deal with.
I don’t see why this a problem. Both C and Rust give you 8 bit and 16 bit types to work with. It’s true that you may sometimes need assembly to eke out the last drops of performance on such small chips, but equally sometimes you don’t and C/C++/Rust are excellent tools for the job.
You can use wider types, maybe even floats I can't recall, they'll get lowered to the target architecture and the generated code will be in terms of narrower registers.
A 32-bit add will get turned into 1 8-bit add and 3 8-bit add-with-carry instructions. You won't even notice, unless to your point, you see a performance issue or start running out of code space.
With all due respect, basically everything. What you describe sounds like Haskell laziness, maybe that's what you were thinking of? Or the memory model of e.g. Ocaml and Lisp, but those are then optimized at compile-time.
Rust enforces many guarantees w.r.t. memory access & sharing at compile time, but at codegen time, it's basically as vanilla and ‶boring″ as C++.
Pretty much all of it. Rust has immutable variables but also has mutable ones, and is not "designed around immutable variables". Additionally, immutable variables don't generally need to be "copied all over the place" . Rust is not particularly memory inefficient compared with C++.
I kept a short list of the times I was disappointed by avr-gcc on my last project (11.1.0 mind you, not the ancient 5.4.0 one maintained by Microchip):
- Expected: memcpy uses `ld X+` followed by `st Z+`. The same is expected here (20 cycles)
- Actual: the compiler doesn't know about that the X register support post increment, creating `adiw`/`sbiw` chains instead (29 cycles, +18B). This is especially bad when the compiler is register starved when writing structs, it makes the program a lot larger than it should be. It's very easy to get a register starve situation: read from a pointer (register Z) to write a struct referred to by a pointer (X register), with both pointers originally stored on the stack (Y register). Oops, the struct uses the X register, every access generates 2x the code it should.
It's a shame the author didn't include the C snippet so that we might verify this.
Not that I'd be so surprised if avr-gcc doesn't produce spectacular code. Some of this is very surprising. How could the compiler function of it doesn't know which registers have which values? Unless it was emitting very simple chunks.
In particular there seem to be couple of MMIO stores, maybe those are done with some accessor/macro which accidentally makes too much stuff volatile causing this extra zero-addition etc. So yes seeing all the code could give some clues...
I was also surprised by how suboptimal avr-gcc's generated code is. On the other hand, the AVR experience is reminiscent of early DOS, terrible C compilers and all, so it remains nostalgic!
How does this even happen? Shouldn't basically all the code that made these terrible decisions be shared between all backends? This should be done before it's lowered to any specific assembly language
> Shouldn't basically all the code that made these terrible decisions be shared between all backends?
What makes you think this? The "middle-end" (in GCC-lingo) is where architecture-specific optimisations happen and that needs to be modified on a per-architecture basis. As an obvious example, not all CPUs have vector instructions, yet auto-vectorisation is part of the optimisation passes for architectures that support it.
There's several ways of doing this, but the assembly-emitter (i.e. the "backend") is not the part concerned with that at all.
Because the issues aren't codegen, they're weird, broken optimisations. Loading values several times, subtracting zero.
Why would gcc have several register allocators instead of one slightly more configurable one? Why aren't useless actions (subtract by zero) removed in generic optimisation?
Of course there are architecture specific optimisations, but the examples provided here are not of that kind
gcc is a complex beast and optimisation is hard. There's no such such thing as "generic optimisations" that remove useless actions, because depending on the hardware, certain seemingly useless actions are actually required to make correctness guarantees. Without access to the offending source code, one can only speculate what actually led to this result.
Another problem is simply that there's probably only a handful of people, likely employed by a single company or two, who maintain the parts for this specific architecture. It's very likely that there are too few users who actually report this kind of issues and too few resources available to test and address them.
It's not an issue with gcc or its optimiser in general.
Ah. I thought GCC was more similar to LLVM in that aspect, which definitely has a generic optimiser, though different users of the library enable different ones
> Most people believe that modern compilers generate better-optimized assembly code than humans, but look at this example from AVR-GCC 5.4.0 with -O2 optimization level
Modern compilers? GCC 5.1.0 was released in 2015, and 5.4.0 is a bugfix release. Why would someone blog about its poor code generation without bothering to try the latest release?
Besides, the AVR is an exceptional architecture: an 8-bit RISC with sizeof(int) == sizeof(short) == 2. Neither GCC nor C were designed for 8-bit processors.
I don't know why it's an older version, you'll have to ask the Debian people. Maybe there's a reason, but most likely ist's justthat gcc tends to be a difficult piece of software to package and not enough people care/have time to keep the avr version to to date.
I actually have no idea who maintains AVR-GCC, but this seems like an excellent bug report that could improve the performance of a very important tool chain!
Amazing how in the world of software, 6 years is considered "ancient".
I know the field is a fast moving one, but my goodness. The target architecture is used in the world of embedded hardware where product lifetimes often can be measured in decades (think locomotives, industrial installations, ships, etc.)
For compliance and certification reasons alone you're often stuck with using older software versions. 6 years isn't actually that bad in that context. Keep in mind that many devices, such as ATMs, still use OS/2(!) or Windows 2000(!) as their operating systems, for example.
Well and it's not like WG14 has been doing anything constructive for the last 20 years. So it's not like there are any new features to take advantage of.
Far as I can tell the only thing I get out of newer versions of gcc is more pointless warnings to silence.
I found the default jump table implementation to be quite costly when using something with a very small flash size such as the attiny10. Luckily gcc now has goto label support for inline assembly.
Well, still much better than sdcc, which often fails to optimize array accesses on certain targets.
And just recently I took acr-gcc down the - O3 -flto rabbit hole, and everything survived, and with the huge size win, I could add plenty of new functionality and checks.
AVR has historically been one of the best routes to get into embedded work because there is solid open-source tooling, interfacing the hardware is simple, and it's very well documented in the open by Atmel/Microchip. Arduino would not exist without avr-gcc.