Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can run the llama 70b based models faster than 10 tkn/s on 24gb vram. I've found that the quality of this class of LLMs is heavily swayed by your configuration and system prompting and results may vary. This Reddit post seems to have some input on the topic:

https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/llama_3...

I haven't used any agent frameworks other than messing around with langchain a bit so I can't speak to how that would effect things.



You would probably get the same tokens per second with llama 3 70b if you just unplugged the 24gb GPU. For something that actually fits in 24gb of VRAM, I recommend gemma 2 27b up to q6. I use q4 and it works quite well for my needs.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: