Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it's a confusion of similar terms - there's the TLB (translation lookaside buffer) of the MMU, which is an extremely small fast buffer of recent MMU translations - there's no chance that is anywhere near 32mb. This is often designed to be hit within a single cycle, so has tighter restrictions than even the l1 caches of modern CPUs.

Then there's also the GPU tile buffer (which they may also be calling a "TLB" here?) - which I think is what he's referring to here - on a TBDR system there's an advantage if it can calculate multiple passes on the image while keeping all needed info in the on-chip tile buffer (Think stuff like depth and color data when rendering a large scene), as if the intermediate results aren't actually needed (such as a partially drawn scene where more objects will be drawn on top of some of the pixels, or an intermediate output before some post-processing pass) it may never actually touch the main memory so has the ability to save a lot of bandwidth.

I'm pretty sure it's the second the OP is referring to here, as if you blow past this buffer size you can hit a performance cliff, and methods to optimise it's use (such as rendering tiles of multiple passes instead of doing each pass on the full screen view separately) doesn't really make any difference to immediate renderers, so is something "new" developers need to take into account for deferred tiler architectures.



I'm an undergrad and know nothing about GPUs so forgive me for these questions (I tried reading a bit just now). Where does the tile buffer come into play for GPGPU workloads?

From my simple understanding of it. TBDR GPUs extract performance by tiling and binning primitives and handing that out for rasterization. The multiple passes allow it to work on both at the same time? Kinda like how a CPU pipelines stuff? I thought GPGPU workloads means that it skips the rasterization stage? So what's the problem with treating it like an IMR.


A modern renderer does rasterization followed by a sequence of post-processing steps that you can think of as compute dispatches (kernels) running one thread per pixel. The reality is often somewhat more complex, but that's a good first approximation.

Those steps tend to be local, so instead of running step 1 on the whole screen, then step 2 on the whole screen, and so on, you could run all steps on the top-left tile (of e.g. 128x128 pixels), then all steps on the next tile, and so on. The downside is that you'll likely have to compute some data in the boundary regions between tiles multiple times. The upside is that the bulk of intermediate data between post-processing steps never has to be written out to and read back from memory (modern render targets are too large to fit into traditional caches, though that may be different with the huge cache AMD has built for their latest GPUs).

The same principle can be applied to GPGPU algorithms that have similar locality. This tends to be discussed under the label of "kernel fusion".


My understanding is, that the rasterization process also happens per tile.

One of the cases where this can cause performance problems is, when you want to read the output of a previous render pass. If you want to be able to read arbitrary parts of the output of the previous render pass, the output buffer probably needs to be copied from the tile memory into a memory that can hold the whole buffer at once. Furthermore, this also means that all tiles of the previous render pass need to execute before the next one can run. This limits how much work can be done in parallel.


Modern machine can do TLB and L1 lookup in parallel. Here is how it works on a traditional CPU (it is different on M1).

The page size is 4kb. This means the lower 12 bits of an address is the same between logical and physical addresses. The cache line is 64 bytes. The lower 6 bits of the address are indexing within a cache line. The L1 is 8 way associative so the other 6 bits addresses 8 cache lines in the L1. This makes 64*8 cache lines of 64 bytes => 32k of L1 cache.

The CPU does a lookup in TLB and a lookup in L1 in parallel, and gets 8 cache lines from the L1, which are filtered by the results in the TLB to hopefully get a hit.

Now you'll note that, while most CPU have 32k of L1, the M1 has 128k, which means it needs 2 extra bits to match between physical and logical addresses to pull the same trick. And what do you know, M1 has 16k pages! What a coincidence (not!).


Fourth tweet states that he does mean translation lookaside buffer. If it’s true, it’s indeed insanely big.


They also refer to it as a "Transaction Lookaside Buffer", which reinforces the impression given off by the overall twitter thread that this was secondhand information relayed by somebody who has a very weak grasp of the technical details involved.


Vadim is a smart guy but not a programmer.

So you’re right he may not mean Transaction Lookaside Buffer.


I'm pretty sure that's false, as translation lookaside buffers simply cannot be of that size while hitting the performance requirements.

I suspect they've seen someone refer to a TLB (Translation Lookaside Buffer) and TLB (TiLe Buffer) and conflated the two.


32Mb may mean the total amount of memory that can be mapped at once in the graphics TLB .... Main cpu usage of TLBs is likely very different from graphics usage (which is pulling all the data for a frame 60 times a second) ... For some workloads 32Mb might be tiny and result in the TLB thrashing


I suspect it’s 32MB of coverage, which would be 2,000 entries which is a typical TLB size these days. Although then what’s he talking about the tile buffer?


2000 is probably the L2 TLB size, L1 TLBs are usually in the order of 10s of entries (but as I pointed out the decisions you make for graphics engines are very different from those you make for CPUs)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: