The 4x comes from the neural accelerators (tensor core in NVIDIA jargon). It's 4x fp16 over the vector path (And 8x compared to M1 because at some point they 2x'd the fp16 vector path). Therefore LLM prefill(context processing/TTFT), diffusion models (image gen), and e.g. video and photo effects that make use of them can be up to 4x faster.
At fp16 that's the same speed at the same clock as NVIDIA.
But NVIDIA still has 2xfp8 and 4xnvfp4.
Batch-1 token generation, that is often quoted, does not benefit from this. It's purely RAM bandwidth-limited.
Batch-1 token generation, that is often quoted, does not benefit from this. It's purely RAM bandwidth-limited.