ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs (2019) [pdf]

userbinator · on March 28, 2020

From the paper:

Our novel operations function at the circuit level by forcing the commodity DRAM chip to open multiple rows simultaneously by activating them in rapid succession.

That reminds me of this:

https://www.linusakesson.net/scene/safevsp/

It's worth reading the long explanation there, and the HN discussion at https://news.ycombinator.com/item?id=11845770 , but the critical part is this:

In short, one memory cell gets refreshed with the bit value of a different memory cell.

What I find more amazing is that this effect has probably been noticed for as long as DRAM existed and treated as a bug (like above), and it took several decades for someone to think "that could be useful!"

ChainOfFools · on March 28, 2020

I seem to recall an unusual proof of work scheme (cuckoo iirc) which depended on some similar exploit of off-label DRAM timing behavior, but just as a means of assuring a standard measuring unit rather than to perform compute ops.

tromp · on March 29, 2020

As the author of the Cuckoo Cycle PoW https://github.com/tromp/cuckoo I don't recall any such dependence. I don't see how the operations described in the paper would benefit a Cuckoo Cycle solver, which in essense looks like https://github.com/tromp/cuckoo/blob/master/doc/simplesolve

noipv4 · on March 28, 2020

Micron had implemented an Automata processor on DRAM.

https://insidebigdata.com/2019/07/26/machine-learning-with-m...

https://semiwiki.com/forum/index.php?threads/survey-paper-on...

glangdale · on March 29, 2020

This wasn't really on DRAM in the same way as the paper is, interesting though that project was. Sadly, it's more or less defunct. The academic research at UVa has largely come to an end and Micron spun out the project to a startup that seems to be moribund (http://www.naturalsemi.com/).

I tracked these guys because we were doing automata in s/w (the https://github.com/intel/hyperscan project at Intel). There were some aspects of their methodology I didn't like (notably, they tended to run benchmarks that spewed matches, slowing Hyperscan down, while turning off matches on the h/w card!) but I found them v. interesting and they had some really thought-provoking stuff.

pulse7 · on March 28, 2020

In-Memory compute would be useful for garbage collection. Compare also with this article: https://people.eecs.berkeley.edu/~krste/papers/maas-isca18-h... (note: one of the article authors is Krste Asanovic, who is on the RISC-V chairman of the board).

dang · on March 28, 2020

A single but excellent comment from last year: https://news.ycombinator.com/item?id=21992296

Someone · on March 28, 2020

That comment says

”The Elxsi 6400, built in the early 1980s could do logical operations to memory using the off the shelf drams of the day. Designer was Harold (Mac) McFarland, of PDP-11 fame.”

I googled for more info, but didn’t find info about that. There’s a Wikipedia page (https://en.wikipedia.org/wiki/Elxsi) that confirms the McFarland part, and I found the system architecture (https://amaus.net/static/S100/elxsi/systems/Elxsi%20System%2...), which is interesting (instructions for multi-precision ascii addition and subtraction, for example, I would guess for supporting COBOL), but nothing about those DRAM tricks. Does anybody have more info?

LargoLasskhyfv · on March 29, 2020

Today i learned about Elxsi 6400! What an interesting part of systems history.

etaioinshrdlu · on March 28, 2020

It seems to me like the biggest limitation is that while multiple rows can be operated on at once, there is apparently no operation that can mix the values of different columns. So bits at index 0 within a number can only be dependent on other bits at index 0.

Anyone see anything that contradicts this?

DesiLurker · on March 28, 2020

what I'd like is a programmable vector math unit with adjustable vector length so I can do massive compute operations without having to pay memory access latency cost. for instance, it would be awesome if CPU can just instruct to do alpha blend 2 frames command to the memory and avoid the whole fetch decode execute loop altogether.

On top of this it would be really awesome if this attached 'ram compute block' was re-programmable like fpgas or a simpler higher level construct. then we could essentially do all the 'parallel & dumb' computations in ram itself so CPU would mostly be for control and branch code.

jiggawatts · on March 28, 2020

With 5nm coming soon, I think this is inevitable. It's now possible to put 127M transistors in a single square millimetre, but despite billions of transistors in a CPU, total performance is not hundreds of thousands of times greater than primitive CPUs from decades ago.

It would cost next to nothing in terms of chip area to have a simple "controller" CPU in each DRAM chip that can issue vector instructions without needing to have its hand held.

The problem is the instruction set: where do you stop? Do you just have simple logic operations, or a full set of floating point operations? Do you have flow control? Stacks? Etc...

Code would either have to be written in the "full featured" language as well as a cut-down language similar to CUDA, or the embedded little chips would have to be full CPUs.

The other issue is cache coherence. It's possible to have designs where either the little DRAM CPUs participate or they don't. Both have big advantages and big disadvantages, and neither is easy.

I suspect that this is going to start turning up in GPUs before CPUs, or possibly for HPC applications before general purpose computers.

nine_k · on March 29, 2020

Since designing and perfecting new, highly parallel programming methods is hard, I can imagine stretching current approaches.

Spread ALUs and simple control units across RAM cells, there's little needed because the RAM is its own registers. Some distant big control unit will send out instructions to the processing units, and orchestrate I/O. A bit like GPU but with a different set of constraints. Likely it could be made compatible with OpenCL or CUDA.

DesiLurker · on March 29, 2020

cache coherence is a valid concern but I anticipate a clever snooping hardware could invalidate the 'touched' regions, btw this is already done for numa architecture based systems for a while.

regarding vector instruction explosion, this why I left a remark around programmable fabric (which does have to be super fast reconfigure). this way you could morph a bunch of logic blocks into whichever flavor you want. btw this is also not a first either, companies like Stretch & Mathstar have tried to do similar re-programmable fabrics & more recently altera had done re-programmable fabric using parallel/gpgpu languages like OpenCL. one good thing with programmable fabric in this context is that there is not an immense pressure to fit logic in a cycle budget because you can always claim a certain vector instruction takes X cycles to complete without effecting simpler operations taking Y (<< X) cycles.

cache coherency issues notwithstanding, you are right about it turning up in GPUs first, simply because as the target resolution scales past 4k & 8k, VR etc it would be imperative to do a lot of similar parallel operations on huge chunks of memories and memio b/w would be the biggest bottleneck there. this could mostly alleviate that.

what I am unclear about is how does putting programmable fabrics like this impacts DRAM yields?

LargoLasskhyfv · on March 29, 2020

Slightly OT, but have you seen https://news.ycombinator.com/item?id=22392160 ?

Wherein he explains something like (up to) "AVX-32768" at about 36 minutes and 30 seconds ( https://youtu.be/5Z7cmyYakAw?t=1594 ) besides the latest AVX-512 with no downclocking whatsover at 2,5GHz.

This is not exactly "In-Memory" but since it's all in the SOC, connected to ring bus with low latency, and large cache...?

All in all very impressive chip, especially when considering the technology they had made their samples in. You can't buy it yet, unfortunately.

https://centtech.com/ai-technology/

They did an AMA on reddit two months ago https://old.reddit.com/r/hardware/comments/ep0lf4/x86_cpu_de...

https://fuse.wikichip.org/news/3099/centaur-unveils-its-new-...

https://fuse.wikichip.org/news/3256/centaur-new-x86-server-p...

Cha, Cha, Cha!

etaioinshrdlu · on March 28, 2020

Interesting that they get around lack of NOT operations (and thus lack of universal logic capability) by storing each value twice: once normally, and once bitwise-negated.

gowld · on March 28, 2020

Not so much "getting around the lack of NOT" as "implementing NOT".

DonHopkins · on March 28, 2020

I'm confused: Are you using the Bill & Ted Sarcastic Postfix NOT or not?

peter_d_sherman · on March 28, 2020

Do these operations make the memory Turing-complete?

I don't have the expertise to know...

If they don't -- then what would be needed to be done to the memory controller (in this case a custom FPGA memory controller) to make it Turing-complete?

Also... even if it isn't Turing-complete, then probably whatever functionality is missing could be implemented by the FPGA -- although at the probable cost of a memory round trip for those instructions, right?

In other words, you could probably use this for mixed-mode, hybrid, Turing-completeness via additional FPGA instructions, even if the operations on RAM aren't Turing-complete in and of themselves -- or am I missing something?

All of this sounds very promising!

Great concept, great paper -- hope you get well funded for your next round of research!

jecel · on March 29, 2020

It has the functionality of the first Connection Machine, except with no network to exchange values between columns (as another comment pointed out). You can do 64K logic operations, each on one bit, at a time. You still need an external controller (which can be in the FPGA, as you suggested) to fetch instructions, do loops and things like that.

The CM-1 could do conditional execution: you tell the 64k processors to add but only those with a give flag set to 1 will actually do it while the others will execute a NOP. It is possible to simulate this on this design but it would be rather awkward, just like their 1 bit addition is awkward compared to the one clock equivalent in the CM-1.

a1369209993 · on March 29, 2020

Strictly speaking it's either turing complete already because it implements arbitrary non-acyclic boolean circuits, or it can never be turing complete because it's limited to finite memory, but two things come to mind as very useful for expanding it's practical usefulness that would probably have to done in the memory controller:

Transpose. (Take a bit-parallel array in row 0, columns 0,1,... and place it (bit-serially) in column 0, rows 0,1,... - and preferably do the same for row 1 to column 1, row 2 to column 2, etc simultaneously.)

Random access (Take a bit-serial index N in rows 1,2,... and load or store row 0 to/from row N.)

ColanR · on March 29, 2020

So is there any example code? I would love to experiment with this on an extra computer I have lying around.

magicalhippo · on March 29, 2020

You need to modify the memory controller so I'm guessing a FPGA is required, though maybe a microcontroller can be used as well. Also note that they only got the AND/OR operations to work on a couple of the eight DRAM modules tested so chances are you have to be lucky or source those parts as well.