USENIX
Overcoming Challenges in Serving Large Language Models - SREcon23 Europe/Middle East/Africa
Explore hosting GPT models in Kubernetes, covering GPU sharding, tensor parallelism, and model optimization. Learn trade-offs between latency, accuracy, and resource allocation, with a live demo showcasing performance.