*For one TensorFlow is not a generic framework like cuda is, so you lose a whole...

shaklee3 · on Sept 14, 2020

I gave you an example of something you can't do, which is an overlap-save FFT, and you ignored that completely. Please implement it, or show me any example of someone implementing any custom FFT that's not a simple, standard, batched FFT. I'll take any example of implementing any type of signal processing pipeline on TPU, such as a 5G radio.

Your last sentence is pretty funny: a GPU can't do certain workloads because one it can do is too slow for you. Yet it remains a fact that TPU cannot do certain workloads without offloading to the CPU (making it orders of magnitude slower), and that's somehow okay? It seems where this discussion is going is you pointed to a TensorFlow library that may or may not offload to a TPU, and it probably doesn't. But even that library is incomplete to implement things like a 5G LDPC decoder.

sillysaurusx · on Sept 14, 2020

Which part of this can't be done on TPUs? https://en.wikipedia.org/wiki/Overlap%E2%80%93save_method#Ps... As far as I can tell, all of those operations can be done on TPUs. In fact, I linked to the operation list that shows they can be.

You'll need to link me to some specific implementation that you want me to port over, not just namedrop some random algorithm. Got a link to a github?

If your point is "There isn't a preexisting operation for overlap-save FFT" then... yes, sure, that's true. There's also not a preexisting operation for any of the hundreds of other algorithms that you'd like to do with signal processing. But they can all be implemented efficiently.

Yet it remains a fact that TPU cannot do certain workloads without offloading to the CPU (making it orders of magnitude slower), and that's somehow okay?

I think this is the crux of the issue: you're saying X can't be done, I'm saying X can be done, so please link to a specific code example. Emphasis on "specific" and "code".

shaklee3 · on Sept 14, 2020

Let's just leave this one alone then. I can't argue with someone who claims anything is possible, yet absolutely nobody seems to be doing what you're referring to (except you). A100 now tops all MLPerf benchmarks, and the unavailable TPUv4 may not even keep up.

Trust me, I would love if TPUs could do what you're saying, but they simply can't. There's no direct DMA from the NIC to where I can do a streaming application at 40+Gbps to it. Even if TPU could do all the things you claim, if it's not as fast as the A100, what's the point? To go through undocumented pain to prove something?

sillysaurusx · on Sept 14, 2020

FWIW, you can stream at 10Gbps to TPUs. (I've done it.)

10Gbps isn't quite 40Gbps, but I think you can get there by streaming to a few different TPUs on different VPC networks. Or to the same TPU from different VMs, possibly.

The point is that there's a realistic alternative to nVidia's monopoly.

shaklee3 · on Sept 14, 2020

When I can run a TPU in my own data center, there is. Until then it precludes a lot of applications.