Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AVR-GCC Compiler Makes Questionable Code (bigmessowires.com)
73 points by zdw on Dec 17, 2022 | hide | past | favorite | 55 comments


The fact that avr-gcc exists at all is quite amazing though. It's allowed many people to get into embedded design without needing to purchase licenses for other proprietary compilers. It might not be the most optimal compiler, but sub-optimal does not mean incorrect. If you know enough to quibble with the code, and you're seeing performance issues like this, you can just inline your assembly into C.

AVR has historically been one of the best routes to get into embedded work because there is solid open-source tooling, interfacing the hardware is simple, and it's very well documented in the open by Atmel/Microchip. Arduino would not exist without avr-gcc.


I've hoped for a while that LLVM's support for such targets would improve, but maybe the ubiquity of ARM microcontrollers makes that more unlikely now. There is an AVR target for example, but in my attempt to use Rust on an AVR I ran into several incorrectness bugs - and newer versions of LLVM are plain broken [0]. There is also an MSP430 backend but I've never had an excuse to play around with that...

Maybe the libgccjit backend for Rust will solve the specific scenario I was interested in, but it doesn't improve anything for LLVM itself.

[0] - https://github.com/rust-lang/rust/issues/88252


Took a while but this issue was fixed this year. Until then, you could compile a patched version of LLVM, or use an older nightly.

And the code generation was quite broken indeed. Most importantly the saving of registers within interrupt handlers. Hard to understand the bug, but I had documented a simple inline asm solution on the issue. It has been fixed since then.

My last experience was quite good. And I didn't notice wrong codegen.

As an aside, the inline asm combines pretty well with generics to produce custom machine code during compilation. For example I was able to reproduce the gcc-avr built-in for delays: https://github.com/avr-rust/delay/blob/cycacc/src/delay_cycl...


Oh hey, you're right, I went and found the issue but didn't scroll down after seeing it was still open. That's great! I might try and revive my previous project. I had tried an older nightly but that's where I think I was seeing bad codegen.


Off topic, but I can confirm Rust's MSP430 backend works great.


Isn't Rust designed around things like immutable variables, where you need to copy and copy and copy the same things over and over, and therefore expect to run it on something with infinite amounts of memory? It might be suitable for modern desktop computers which have hundreds or even thousands of kilobytes of memory available, but not a microcontroller with 2kB of RAM.


You can fit embedded rust programs in 2kb of RAM.

The smallest binary rustc has ever produced on x86_64 is 137 bytes: https://github.com/dgotrik/tiny-rust-executable

The projects I work on are in the 32 bit ARM space, so we have more space than the projects you're talking about, but we use a microkernel, compile all programs separately, and then put them all together at the end. Some example programs and sizes:

* kernel: 32k flash, 8k RAM

* supervisor task: 8k flash, 2k RAM

* SPI tasks: 16k flash, 2k RAM

* I2C task: 16k flash, 2k RAM

* sensors task: 8k flash, 8k RAM

* an "idle" do-nothing task: 128 bytes flash, 256 bytes RAM

(Note that some of these are larger than is actually required, I got these numbers by checking in on the amount of memory space they request from the OS, which has to be a power of 2 in the current implementation, so something that's 4097 bytes ends up being 8k in these numbers.)


I use AVR Rust and have dumped the resulting assembly output and I was suitably impressed. It optimized the living daylights out of the code and the resulting assembly was amazingly short and did exactly what you'd expect. It didn't really waste any stack space or RAM. It's truly amazing to see modern rust running on an 8 bit CPU in such a reasonable way (of course it also helps to write in an 8 bit friendly way, like using u8 or i8 for things that don't need to be big), but it's also nice to know I can just use a 64 bit int and it'll do whatever it has to do to work.


You're thinking Haskell. Rust has mutability all over the place, but it has safeguards against accidentally sharing across threads without synchronization or iterator invalidation.


Haskell also doesn’t need infinite memory (like parent claimed about Rust). It has immutability but it also has a garbage collector. It tends to use less memory than Java for example (to the extent that the benchmarks game is a useful benchmark):

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


For tiny tiny programs that don't need much memory, Java default allocation is obviously larger. For programs that need memory to be allocated less so.


The languages that have immutable types have toons of tricks how to optimize their execution, the immutability is only exposed at semantic level.


You need to learn about something before you talk publicly about it.

I use rust on 8bits micro controllers with 128 bytes of RAM and enough flash to store about 1000 instructions.

Enough to decode standard 433mhz radio remotes, learn them, store and recall into eeprom and drive a dozen LEDs and a button for the UX.


immutable variables isn’t ever a problem. compilers frequently translate programs into SSA form, and they manage to work for all platforms without memory bloat or anything.


Maybe I'm dumb, but why would you need to copy an immutable value?


Because the only way to have a modified version of that value is to make a brand new copy with the modification.


Once you've got an optimiser none of that really matters. Clang actually goes the other way, it emits IR with stack allocated variables (alloca), but LLVM's mem2reg pass then attempts to convert those into single static assignment (SSA) form instead.


This is incorrect


Which part is incorrect?


It's not fair to say that rust is 'designed around immutable variables' - rust has move semantics, but move semantics just mean that if you have a variable, 'a' and you assign it to variable 'b', the variable 'a' is now dead and can no longer be referenced. This exists at the lexical level, it's a no-op in generated code.

Is that maybe what you were referring to?

For certain scalar types, flexibility is increased by allowing 'copy' semantics where assigning 'a' into 'b' makes 'b' a copy of 'a', and both are alive. Then it ends up mattering how heavy the type is - although you can only implement 'copy' for things you can trivially memcpy, so nothing on the heap.

Generally anything that would be expensive to duplicate doesn't get 'copy' semantics, but instead requires you to move it into a new variable, or explicitly clone it.

Rust also has immutable-by-default semantics, but it's by default. You can mutate the contents of structs, but there can only be one mutable reference XOR an arbitrary quantity of read-only references and aliasing is not permitted. This forms the basis for many of the safety guarantees.

Did that help? I was guessing at what you meant, so if that wasn't it I can always try again.

[edit] within the context of microcontrollers Rust requires you to be very explicit about what is and isn't permitted, and how things should work. You can disallow in your construction pretty much anything expensive or non-trivial.


(Moves are a memcpy semantically but the optimizer can remove them much of the time.)


So really anything on an AVR8 that isn't either an 8- or 16-bit int, unsigned or signed, is going to be complete and utter monster to deal with.

Natively the CPU deals with 8-bit values. Obviously that's a little cramped so you can just about get away with using a few more instructions to do 16-bit. If you absolutely must, 32-bit ints aren't horrible to cope with, but then you start to get into a lot of unnecessary code when you want to change size.

Even a very high level language like C is a bad idea on something so constrained, because C assumes that everything is a massive approximately VAX-like architecture with mappable memory all over the place, and limitless amounts of it, possibly as much as one or two megabytes.


> Even a very high level language like C is a bad idea on something so constrained, because C assumes that everything is a massive approximately VAX-like architecture with mappable memory all over the place, and limitless amounts of it, possibly as much as one or two megabytes.

That's not really true for the AVR family; the instruction set and general architecture was designed with C in mind. Unlike say a PIC microcontroller, the AVR family has a hardware stack pointer (SPH/SPL) and a large number of 8-bit registers which can also be referenced in 16-bit pairs for the (albeit limited) set of instructions which support it.

C makes some assumptions (for instance assuming the existence of a stack pointer), but the AVR designers kept that stuff in mind. Pretty sure AVR C actually uses ILP16 data model, not an 8-bit model as you may be expecting.

C doesn't make assumptions about the size of an address space, although you can when specifying the data model for your architecture.

The only thing that's slightly less than clean programming AVRs using C is that they're Harvard architecture instead of Von Neumann so you have to access program memory via special instructions (lpm/elpm/spm). That's wrapped with a __attribute__((progmem)) specifier in AVR GCC so the compiler knows it uses a different address space.


> So really anything on an AVR8 that isn't either an 8- or 16-bit int, unsigned or signed, is going to be complete and utter monster to deal with.

I don’t see why this a problem. Both C and Rust give you 8 bit and 16 bit types to work with. It’s true that you may sometimes need assembly to eke out the last drops of performance on such small chips, but equally sometimes you don’t and C/C++/Rust are excellent tools for the job.


You can use wider types, maybe even floats I can't recall, they'll get lowered to the target architecture and the generated code will be in terms of narrower registers.

A 32-bit add will get turned into 1 8-bit add and 3 8-bit add-with-carry instructions. You won't even notice, unless to your point, you see a performance issue or start running out of code space.


AVR GCC totally lets you use floats, I'm not sure about Rust. It does as you'd expect result in a lot of code, but it seems correct.


It does, but they are unbearably slow.


> and limitless amounts of it, possibly as much as one or two megabytes.

where/how does C make this assumption?


With all due respect, basically everything. What you describe sounds like Haskell laziness, maybe that's what you were thinking of? Or the memory model of e.g. Ocaml and Lisp, but those are then optimized at compile-time.

Rust enforces many guarantees w.r.t. memory access & sharing at compile time, but at codegen time, it's basically as vanilla and ‶boring″ as C++.


Pretty much all of it. Rust has immutable variables but also has mutable ones, and is not "designed around immutable variables". Additionally, immutable variables don't generally need to be "copied all over the place" . Rust is not particularly memory inefficient compared with C++.


I kept a short list of the times I was disappointed by avr-gcc on my last project (11.1.0 mind you, not the ancient 5.4.0 one maintained by Microchip):

Get highest bits with a right shift: https://godbolt.org/z/nqseMvMh5

- Expected: shift only highest byte of word (3 cycles).

- Actual: all bytes of word are shifted several times (184 cycles, +18B).

Copy low nibble to high nibble in bit field struct: https://godbolt.org/z/x8nrTWPja

- Expected: swap & or byte (4 cycles).

- Actual: mul used to shift left (7 cycles, +6B).

Extract bits 6:11 from an byte array encoding a 16 bit field: https://godbolt.org/z/EPTnbjGeP

- Expected: build result into high byte by shifting bits 6:7 from low byte into high byte using rotate (11 cycles).

- Actual: mul used to shift (16 cycles, +8B), or build result into low byte (13 cycles, +4B).

Unrolled memcpy: https://godbolt.org/z/Yvbr7vT5a

- Expected: memcpy uses `ld X+` followed by `st Z+`. The same is expected here (20 cycles)

- Actual: the compiler doesn't know about that the X register support post increment, creating `adiw`/`sbiw` chains instead (29 cycles, +18B). This is especially bad when the compiler is register starved when writing structs, it makes the program a lot larger than it should be. It's very easy to get a register starve situation: read from a pointer (register Z) to write a struct referred to by a pointer (X register), with both pointers originally stored on the stack (Y register). Oops, the struct uses the X register, every access generates 2x the code it should.


It's a shame the author didn't include the C snippet so that we might verify this.

Not that I'd be so surprised if avr-gcc doesn't produce spectacular code. Some of this is very surprising. How could the compiler function of it doesn't know which registers have which values? Unless it was emitting very simple chunks.


In particular there seem to be couple of MMIO stores, maybe those are done with some accessor/macro which accidentally makes too much stuff volatile causing this extra zero-addition etc. So yes seeing all the code could give some clues...


He included some code in the comments: https://bigmessowires.com/static/avr-gcc-main.cpp


I was also surprised by how suboptimal avr-gcc's generated code is. On the other hand, the AVR experience is reminiscent of early DOS, terrible C compilers and all, so it remains nostalgic!

Two examples: https://gcc.godbolt.org/z/M4oasdK4a

* Population count: I haven't found a way to convince AVR-GCC to use LSR's shift into carry with ADC.

* Strength reduction: AVR doesn't have a barrel shifter, so it's a huge win to incrementally shift during the loop. AVR-GCC doesn't see this.


How does this even happen? Shouldn't basically all the code that made these terrible decisions be shared between all backends? This should be done before it's lowered to any specific assembly language


> Shouldn't basically all the code that made these terrible decisions be shared between all backends?

What makes you think this? The "middle-end" (in GCC-lingo) is where architecture-specific optimisations happen and that needs to be modified on a per-architecture basis. As an obvious example, not all CPUs have vector instructions, yet auto-vectorisation is part of the optimisation passes for architectures that support it.

There's several ways of doing this, but the assembly-emitter (i.e. the "backend") is not the part concerned with that at all.


Because the issues aren't codegen, they're weird, broken optimisations. Loading values several times, subtracting zero.

Why would gcc have several register allocators instead of one slightly more configurable one? Why aren't useless actions (subtract by zero) removed in generic optimisation?

Of course there are architecture specific optimisations, but the examples provided here are not of that kind


gcc is a complex beast and optimisation is hard. There's no such such thing as "generic optimisations" that remove useless actions, because depending on the hardware, certain seemingly useless actions are actually required to make correctness guarantees. Without access to the offending source code, one can only speculate what actually led to this result.

Another problem is simply that there's probably only a handful of people, likely employed by a single company or two, who maintain the parts for this specific architecture. It's very likely that there are too few users who actually report this kind of issues and too few resources available to test and address them.

It's not an issue with gcc or its optimiser in general.


Ah. I thought GCC was more similar to LLVM in that aspect, which definitely has a generic optimiser, though different users of the library enable different ones


> Most people believe that modern compilers generate better-optimized assembly code than humans, but look at this example from AVR-GCC 5.4.0 with -O2 optimization level

Modern compilers? GCC 5.1.0 was released in 2015, and 5.4.0 is a bugfix release. Why would someone blog about its poor code generation without bothering to try the latest release?

Besides, the AVR is an exceptional architecture: an 8-bit RISC with sizeof(int) == sizeof(short) == 2. Neither GCC nor C were designed for 8-bit processors.


People were saying "modern compilers generate better-optimized assembly code than humans" in 2015 too, and well before that. It's not that long ago.

5.4.0 is the version that ships with Debian and Ubuntu, so I guess that's why this particular version was used.


What do you mean with: "5.4 is the Version that ships with Ubuntu"?

5.4 was the default compiler on Ubuntu 16.04, which is end of life.

Ubuntu 18.04: 7.4 Ubuntu 20.04: 9.3 Ubuntu 22.04: 11.2


gcc-avr is a different package from the regular gcc:

https://packages.debian.org/sid/gcc-avr

https://packages.ubuntu.com/search?suite=kinetic&searchon=na...

I don't know why it's an older version, you'll have to ask the Debian people. Maybe there's a reason, but most likely ist's justthat gcc tends to be a difficult piece of software to package and not enough people care/have time to keep the avr version to to date.


I actually have no idea who maintains AVR-GCC, but this seems like an excellent bug report that could improve the performance of a very important tool chain!


An excellent bug report for a compiler would

* have reduced source code for the example, preferably a single source file, in a form which can be compiled

* have the exact command line used to build it

* have the exact compiler version being used

* apply to a supported version of the compiler (GCC 5 is from 2015...)

* include an accurate explanation of what is wrong with the generated code

Maybe the last one is covered by this blog post, but things like

> I can maybe understand the subtraction of constant 0, if there’s another code path that jumps to 7ba6

suggests that the author hasn't fully analysed the generated code.

The analysis is OK though and with source code it would be a pretty good bug report.


5.4.0 is ancient I think.


Amazing how in the world of software, 6 years is considered "ancient".

I know the field is a fast moving one, but my goodness. The target architecture is used in the world of embedded hardware where product lifetimes often can be measured in decades (think locomotives, industrial installations, ships, etc.)

For compliance and certification reasons alone you're often stuck with using older software versions. 6 years isn't actually that bad in that context. Keep in mind that many devices, such as ATMs, still use OS/2(!) or Windows 2000(!) as their operating systems, for example.


Well and it's not like WG14 has been doing anything constructive for the last 20 years. So it's not like there are any new features to take advantage of.

Far as I can tell the only thing I get out of newer versions of gcc is more pointless warnings to silence.


Very good point.

Although my meaning was more that in the context of opening an issue on a project, this version is ancient.


I found the default jump table implementation to be quite costly when using something with a very small flash size such as the attiny10. Luckily gcc now has goto label support for inline assembly.

https://gist.github.com/russdill/abbacfe5dba1ba4070584efe44f...


Well, still much better than sdcc, which often fails to optimize array accesses on certain targets.

And just recently I took acr-gcc down the - O3 -flto rabbit hole, and everything survived, and with the huge size win, I could add plenty of new functionality and checks.


Excuse me if this is a dumb question, is there something like Godbolt for AVR compilers?


Godbolt* does support AVR gcc at least.

* actually "compiler explorer"


Cheers




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: