What Happens When Multiple People Use My LLM Simultaneously?
Explores how LLM inference engines handle concurrent requests, comparing the resource allocation approaches of llama.cpp and vLLM to provide insights for family sharing scenarios.
My family members are planning to go out and want to use my OpenWebUI. The server security is all set up with SSL and a custom domain. But I have a question: what happens if multiple people use the same LLM at the same time or nearly at the same time? Does the system start a separate LLM instance for each user, or are all requests squeezed into a single instance? Is the context length shared among users?

**The key lies in your inference engine and configuration.**
**If you're using llama.cpp:**
- With the default single-slot configuration, the second request will queue up and wait for the first one to complete before being processed.
- You can configure multiple slots to support concurrency, but at the cost of dividing the context length equally. For example, a 32K context becomes 16K per request with 2 slots, or 8K with 4 slots.
- This pre-allocation mechanism can lead to resource waste: a request might only use a small portion of the context, but the system has reserved the full slot capacity for it.
**If you switch to vLLM:**
- Its KV cache is pooled, with all requests sharing the full context window (e.g., 32K).
- Each request dynamically occupies cache space as needed, without requiring pre-reservation of fixed capacity.
- This means multiple requests can be truly processed in parallel, with each request able to use nearly the full context length when needed, as long as the total usage doesn't exceed the pool size.
So, if it's just for occasional family use, llama.cpp with 2 concurrent slots might be sufficient. But if you want to make more efficient use of VRAM, especially when handling conversations of varying lengths, vLLM's pooled design offers more flexibility.

Finally, don't forget to check your hardware resources—especially VRAM. With more concurrent requests, VRAM usage will increase significantly, so you don't want your family to crash the server while using it.
发布时间: 2025-10-21 13:42