For one TensorFlow is not a generic framework like cuda is, so you lose a whole bunch of the configurability you have with cuda
Why make generalizations like this? It's not true, and we've devolved back into the "nu uh" we originally started with.
This is trivial to do on a GPU, and is built into the library
Yes, I'm sure there are hardwired operations that are trivial to do on GPUs. That's not exactly a +1 in favor of generic programmability. There are also operations that are trivial to do on TPUs, such as CrossReplicaSum across a massive cluster of cores, or the various special-case Adam operations. This doesn't seem related to the claim that TPUs are less flexible.
The raw functions it provides is not direct access to the hardware and memory subsystem.
Jax is also going to be giving even lower-level access than TF, which may interest you.
You did not give an example of something GPUs can't do. all you said was that TPUs are faster for a specific function in your case.
Well yeah, I care about achieving goals in my specific case, as you do yours. And simply getting together a VM that can feed 500 examples/sec to a set of GPUs is a massive undertaking in and of itself. TPUs make it more or less "easy" in comparison. (I won't say effortless, since it does take some effort to get yourself into the TPU programming mindset.)
I gave you an example of something you can't do, which is an overlap-save FFT, and you ignored that completely. Please implement it, or show me any example of someone implementing any custom FFT that's not a simple, standard, batched FFT. I'll take any example of implementing any type of signal processing pipeline on TPU, such as a 5G radio.
Your last sentence is pretty funny: a GPU can't do certain workloads because one it can do is too slow for you. Yet it remains a fact that TPU cannot do certain workloads without offloading to the CPU (making it orders of magnitude slower), and that's somehow okay? It seems where this discussion is going is you pointed to a TensorFlow library that may or may not offload to a TPU, and it probably doesn't. But even that library is incomplete to implement things like a 5G LDPC decoder.
You'll need to link me to some specific implementation that you want me to port over, not just namedrop some random algorithm. Got a link to a github?
If your point is "There isn't a preexisting operation for overlap-save FFT" then... yes, sure, that's true. There's also not a preexisting operation for any of the hundreds of other algorithms that you'd like to do with signal processing. But they can all be implemented efficiently.
Yet it remains a fact that TPU cannot do certain workloads without offloading to the CPU (making it orders of magnitude slower), and that's somehow okay?
I think this is the crux of the issue: you're saying X can't be done, I'm saying X can be done, so please link to a specific code example. Emphasis on "specific" and "code".
Let's just leave this one alone then. I can't argue with someone who claims anything is possible, yet absolutely nobody seems to be doing what you're referring to (except you). A100 now tops all MLPerf benchmarks, and the unavailable TPUv4 may not even keep up.
Trust me, I would love if TPUs could do what you're saying, but they simply can't. There's no direct DMA from the NIC to where I can do a streaming application at 40+Gbps to it. Even if TPU could do all the things you claim, if it's not as fast as the A100, what's the point? To go through undocumented pain to prove something?
FWIW, you can stream at 10Gbps to TPUs. (I've done it.)
10Gbps isn't quite 40Gbps, but I think you can get there by streaming to a few different TPUs on different VPC networks. Or to the same TPU from different VMs, possibly.
The point is that there's a realistic alternative to nVidia's monopoly.
Why make generalizations like this? It's not true, and we've devolved back into the "nu uh" we originally started with.
This is trivial to do on a GPU, and is built into the library
Yes, I'm sure there are hardwired operations that are trivial to do on GPUs. That's not exactly a +1 in favor of generic programmability. There are also operations that are trivial to do on TPUs, such as CrossReplicaSum across a massive cluster of cores, or the various special-case Adam operations. This doesn't seem related to the claim that TPUs are less flexible.
The raw functions it provides is not direct access to the hardware and memory subsystem.
Not true. https://www.tensorflow.org/api_docs/python/tf/raw_ops/Inplac...
Jax is also going to be giving even lower-level access than TF, which may interest you.
You did not give an example of something GPUs can't do. all you said was that TPUs are faster for a specific function in your case.
Well yeah, I care about achieving goals in my specific case, as you do yours. And simply getting together a VM that can feed 500 examples/sec to a set of GPUs is a massive undertaking in and of itself. TPUs make it more or less "easy" in comparison. (I won't say effortless, since it does take some effort to get yourself into the TPU programming mindset.)