I think it's a confusion of similar terms - there's the TLB (translation lookasi...

SwordOfMyBone · on April 14, 2022

I'm an undergrad and know nothing about GPUs so forgive me for these questions (I tried reading a bit just now). Where does the tile buffer come into play for GPGPU workloads?

From my simple understanding of it. TBDR GPUs extract performance by tiling and binning primitives and handing that out for rasterization. The multiple passes allow it to work on both at the same time? Kinda like how a CPU pipelines stuff? I thought GPGPU workloads means that it skips the rasterization stage? So what's the problem with treating it like an IMR.

atq2119 · on April 14, 2022

A modern renderer does rasterization followed by a sequence of post-processing steps that you can think of as compute dispatches (kernels) running one thread per pixel. The reality is often somewhat more complex, but that's a good first approximation.

Those steps tend to be local, so instead of running step 1 on the whole screen, then step 2 on the whole screen, and so on, you could run all steps on the top-left tile (of e.g. 128x128 pixels), then all steps on the next tile, and so on. The downside is that you'll likely have to compute some data in the boundary regions between tiles multiple times. The upside is that the bulk of intermediate data between post-processing steps never has to be written out to and read back from memory (modern render targets are too large to fit into traditional caches, though that may be different with the huge cache AMD has built for their latest GPUs).

The same principle can be applied to GPGPU algorithms that have similar locality. This tends to be discussed under the label of "kernel fusion".

vup · on April 14, 2022

My understanding is, that the rasterization process also happens per tile.

One of the cases where this can cause performance problems is, when you want to read the output of a previous render pass. If you want to be able to read arbitrary parts of the output of the previous render pass, the output buffer probably needs to be copied from the tile memory into a memory that can hold the whole buffer at once. Furthermore, this also means that all tiles of the previous render pass need to execute before the next one can run. This limits how much work can be done in parallel.

deadalnix · on April 17, 2022

Modern machine can do TLB and L1 lookup in parallel. Here is how it works on a traditional CPU (it is different on M1).

The page size is 4kb. This means the lower 12 bits of an address is the same between logical and physical addresses. The cache line is 64 bytes. The lower 6 bits of the address are indexing within a cache line. The L1 is 8 way associative so the other 6 bits addresses 8 cache lines in the L1. This makes 64*8 cache lines of 64 bytes => 32k of L1 cache.

The CPU does a lookup in TLB and a lookup in L1 in parallel, and gets 8 cache lines from the L1, which are filtered by the results in the TLB to hopefully get a hit.

Now you'll note that, while most CPU have 32k of L1, the M1 has 128k, which means it needs 2 extra bits to match between physical and logical addresses to pull the same trick. And what do you know, M1 has 16k pages! What a coincidence (not!).

msk-lywenn · on April 14, 2022

Fourth tweet states that he does mean translation lookaside buffer. If it’s true, it’s indeed insanely big.

microtherion · on April 14, 2022

They also refer to it as a "Transaction Lookaside Buffer", which reinforces the impression given off by the overall twitter thread that this was secondhand information relayed by somebody who has a very weak grasp of the technical details involved.

randyrand · on April 14, 2022

Vadim is a smart guy but not a programmer.

So you’re right he may not mean Transaction Lookaside Buffer.

kimixa · on April 14, 2022

I'm pretty sure that's false, as translation lookaside buffers simply cannot be of that size while hitting the performance requirements.

I suspect they've seen someone refer to a TLB (Translation Lookaside Buffer) and TLB (TiLe Buffer) and conflated the two.

Taniwha · on April 14, 2022

32Mb may mean the total amount of memory that can be mapped at once in the graphics TLB .... Main cpu usage of TLBs is likely very different from graphics usage (which is pulling all the data for a frame 60 times a second) ... For some workloads 32Mb might be tiny and result in the TLB thrashing

rayiner · on April 14, 2022

I suspect it’s 32MB of coverage, which would be 2,000 entries which is a typical TLB size these days. Although then what’s he talking about the tile buffer?

Taniwha · on April 14, 2022

2000 is probably the L2 TLB size, L1 TLBs are usually in the order of 10s of entries (but as I pointed out the decisions you make for graphics engines are very different from those you make for CPUs)