Our novel operations function at the circuit level by forcing the commodity DRAM chip to open multiple rows simultaneously by activating them in rapid succession.
In short, one memory cell gets refreshed with the bit value of a different memory cell.
What I find more amazing is that this effect has probably been noticed for as long as DRAM existed and treated as a bug (like above), and it took several decades for someone to think "that could be useful!"
I seem to recall an unusual proof of work scheme (cuckoo iirc) which depended on some similar exploit of off-label DRAM timing behavior, but just as a means of assuring a standard measuring unit rather than to perform compute ops.
This wasn't really on DRAM in the same way as the paper is, interesting though that project was. Sadly, it's more or less defunct. The academic research at UVa has largely come to an end and Micron spun out the project to a startup that seems to be moribund (http://www.naturalsemi.com/).
I tracked these guys because we were doing automata in s/w (the https://github.com/intel/hyperscan project at Intel). There were some aspects of their methodology I didn't like (notably, they tended to run benchmarks that spewed matches, slowing Hyperscan down, while turning off matches on the h/w card!) but I found them v. interesting and they had some really thought-provoking stuff.
”The Elxsi 6400, built in the early 1980s could do logical operations to memory using the off the shelf drams of the day. Designer was Harold (Mac) McFarland, of PDP-11 fame.”
I googled for more info, but didn’t find info about that. There’s a Wikipedia page (https://en.wikipedia.org/wiki/Elxsi) that confirms the McFarland part, and I found the system architecture (https://amaus.net/static/S100/elxsi/systems/Elxsi%20System%2...), which is interesting (instructions for multi-precision ascii addition and subtraction, for example, I would guess for supporting COBOL), but nothing about those DRAM tricks. Does anybody have more info?
It seems to me like the biggest limitation is that while multiple rows can be operated on at once, there is apparently no operation that can mix the values of different columns. So bits at index 0 within a number can only be dependent on other bits at index 0.
what I'd like is a programmable vector math unit with adjustable vector length so I can do massive compute operations without having to pay memory access latency cost. for instance, it would be awesome if CPU can just instruct to do alpha blend 2 frames command to the memory and avoid the whole fetch decode execute loop altogether.
On top of this it would be really awesome if this attached 'ram compute block' was re-programmable like fpgas or a simpler higher level construct. then we could essentially do all the 'parallel & dumb' computations in ram itself so CPU would mostly be for control and branch code.
With 5nm coming soon, I think this is inevitable. It's now possible to put 127M transistors in a single square millimetre, but despite billions of transistors in a CPU, total performance is not hundreds of thousands of times greater than primitive CPUs from decades ago.
It would cost next to nothing in terms of chip area to have a simple "controller" CPU in each DRAM chip that can issue vector instructions without needing to have its hand held.
The problem is the instruction set: where do you stop? Do you just have simple logic operations, or a full set of floating point operations? Do you have flow control? Stacks? Etc...
Code would either have to be written in the "full featured" language as well as a cut-down language similar to CUDA, or the embedded little chips would have to be full CPUs.
The other issue is cache coherence. It's possible to have designs where either the little DRAM CPUs participate or they don't. Both have big advantages and big disadvantages, and neither is easy.
I suspect that this is going to start turning up in GPUs before CPUs, or possibly for HPC applications before general purpose computers.
Since designing and perfecting new, highly parallel programming methods is hard, I can imagine stretching current approaches.
Spread ALUs and simple control units across RAM cells, there's little needed because the RAM is its own registers. Some distant big control unit will send out instructions to the processing units, and orchestrate I/O. A bit like GPU but with a different set of constraints. Likely it could be made compatible with OpenCL or CUDA.
cache coherence is a valid concern but I anticipate a clever snooping hardware could invalidate the 'touched' regions, btw this is already done for numa architecture based systems for a while.
regarding vector instruction explosion, this why I left a remark around programmable fabric (which does have to be super fast reconfigure). this way you could morph a bunch of logic blocks into whichever flavor you want. btw this is also not a first either, companies like Stretch & Mathstar have tried to do similar re-programmable fabrics & more recently altera had done re-programmable fabric using parallel/gpgpu languages like OpenCL. one good thing with programmable fabric in this context is that there is not an immense pressure to fit logic in a cycle budget because you can always claim a certain vector instruction takes X cycles to complete without effecting simpler operations taking Y (<< X) cycles.
cache coherency issues notwithstanding, you are right about it turning up in GPUs first, simply because as the target resolution scales past 4k & 8k, VR etc it would be imperative to do a lot of similar parallel operations on huge chunks of memories and memio b/w would be the biggest bottleneck there. this could mostly alleviate that.
what I am unclear about is how does putting programmable fabrics like this impacts DRAM yields?
Wherein he explains something like (up to) "AVX-32768" at about 36 minutes and 30 seconds ( https://youtu.be/5Z7cmyYakAw?t=1594 ) besides the latest AVX-512 with no downclocking whatsover at 2,5GHz.
This is not exactly "In-Memory" but since it's all in the SOC, connected to ring bus with low latency, and large cache...?
All in all very impressive chip, especially when considering the technology they had made their samples in. You can't buy it yet, unfortunately.
Interesting that they get around lack of NOT operations (and thus lack of universal logic capability) by storing each value twice: once normally, and once bitwise-negated.
Do these operations make the memory Turing-complete?
I don't have the expertise to know...
If they don't -- then what would be needed to be done to the memory controller (in this case a custom FPGA memory controller) to make it Turing-complete?
Also... even if it isn't Turing-complete, then probably whatever functionality is missing could be implemented by the FPGA -- although at the probable cost of a memory round trip for those instructions, right?
In other words, you could probably use this for mixed-mode, hybrid, Turing-completeness via additional FPGA instructions, even if the operations on RAM aren't Turing-complete in and of themselves -- or am I missing something?
All of this sounds very promising!
Great concept, great paper -- hope you get well funded for your next round of research!
It has the functionality of the first Connection Machine, except with no network to exchange values between columns (as another comment pointed out). You can do 64K logic operations, each on one bit, at a time. You still need an external controller (which can be in the FPGA, as you suggested) to fetch instructions, do loops and things like that.
The CM-1 could do conditional execution: you tell the 64k processors to add but only those with a give flag set to 1 will actually do it while the others will execute a NOP. It is possible to simulate this on this design but it would be rather awkward, just like their 1 bit addition is awkward compared to the one clock equivalent in the CM-1.
Strictly speaking it's either turing complete already because it implements arbitrary non-acyclic boolean circuits, or it can never be turing complete because it's limited to finite memory, but two things come to mind as very useful for expanding it's practical usefulness that would probably have to done in the memory controller:
Transpose. (Take a bit-parallel array in row 0, columns 0,1,... and place it (bit-serially) in column 0, rows 0,1,... - and preferably do the same for row 1 to column 1, row 2 to column 2, etc simultaneously.)
Random access (Take a bit-serial index N in rows 1,2,... and load or store row 0 to/from row N.)
You need to modify the memory controller so I'm guessing a FPGA is required, though maybe a microcontroller can be used as well. Also note that they only got the AND/OR operations to work on a couple of the eight DRAM modules tested so chances are you have to be lucky or source those parts as well.
Our novel operations function at the circuit level by forcing the commodity DRAM chip to open multiple rows simultaneously by activating them in rapid succession.
That reminds me of this:
https://www.linusakesson.net/scene/safevsp/
It's worth reading the long explanation there, and the HN discussion at https://news.ycombinator.com/item?id=11845770 , but the critical part is this:
In short, one memory cell gets refreshed with the bit value of a different memory cell.
What I find more amazing is that this effect has probably been noticed for as long as DRAM existed and treated as a bug (like above), and it took several decades for someone to think "that could be useful!"