Yeah, cloud dev-ops and HPC are so similar and still oh-so different (I run a smaller HPC cluster for genomics). When reading these questions, someone with an HPC background would probably already know the answers (or just accept SLURM as an answer for everything).
With respect to the original questions -- I've investigated running a full SLURM cluster on the cloud and it just never seems worth the effort. For larger clusters than mine, maybe, but then when you start to hit the levels you're talking about, I just don't see the point in moving to the cloud. It would be hard to hit that sweet spot where the costs would make sense. Amazon even has a series on scaling a SLURM cluster with AWS EC2 provisioning, but it just seemed like more work than was justified for us.
With respect to the original questions -- I've investigated running a full SLURM cluster on the cloud and it just never seems worth the effort. For larger clusters than mine, maybe, but then when you start to hit the levels you're talking about, I just don't see the point in moving to the cloud. It would be hard to hit that sweet spot where the costs would make sense. Amazon even has a series on scaling a SLURM cluster with AWS EC2 provisioning, but it just seemed like more work than was justified for us.