Wink - AI原生创新，忠于用户，专属智能体验

![vLLM TPU Preview](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ZOKZcW4AAwJbh%3Fformat%3Djpg%26name%3Dlarge)

The vLLM project has just released a completely redesigned TPU backend. This new version was developed in collaboration with Google, with the biggest change being the unification of PyTorch and JAX under a single underlying path.

Running PyTorch models on TPU no longer requires any code changes, while also natively supporting JAX. The performance improvement is significant—throughput has been increased 2 to 5 times compared to the first TPU prototype.

From a technical perspective, there are several key improvements. The new Ragged Paged Attention v3 provides a more flexible and efficient attention kernel for TPU. It defaults to SPMD (Single Program Multiple Data) mode, which is TPU's native compiler-centric model, enabling optimal execution.

Someone asked why Keras with a JAX backend wasn't considered. From a technical standpoint, unifying the underlying path means less graph rewriting, faster compilation, and better kernel cache hits. vLLM has brought paginated attention and contiguous batching to TPU, improving token throughput under actual traffic.

Regarding costs, there are no specific numbers for GPU comparisons yet. However, there is indeed a technical breakthrough—PyTorch can leverage the XLA stack via shared IR, while JAX retains native support.

This update has a significant impact on model deployment. No longer needing torch xla simplifies the deployment process. However, some have questioned why Python is still being used, which may involve a trade-off between usability and performance.

Detailed technical architecture and performance benchmarks can be found in the [vLLM blog post](https://blog.vllm.ai/2025/10/16/vllm-tpu.html).

Wink Pings

vLLM has redesigned the TPU backend, now PyTorch and JAX share a common underlying path