Wink - AI原生创新，忠于用户，专属智能体验

Over the past few months, there have been quite a few new developments in the field of embodied intelligence. With terms like VLA, VLM, LLM, and vision-based foundation models swirling around, it's time to clear things up.

We've updated the tutorial 'Foundation Models Meet Embodied Agents' at ICCV 2025, incorporating the latest progress and design ideas. The core framework is the Markov Decision Process (MDP), which we use to categorize what foundation models can do and how they should be applied in embodied agents.

![](https://wink.run/image?url=%5Bhttps%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ueKgWX0AA7JBg%3Fformat%3Djpg%26name%3Dlarge%5D(https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ueKgWX0AA7JBg%3Fformat%3Djpg%26name%3Dlarge))

MDP is an old concept, but in the context of combining foundation models with embodied intelligence, it takes on a new meaning. Based on the capabilities required in an MDP, we've categorized the use of foundation models into several types—not by model type, but by the role they play in the agent's decision-making loop.

![](https://wink.run/image?url=%5Bhttps%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufdGwXcAAsfat%3Fformat%3Djpg%26name%3Dlarge%5D(https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufdGwXcAAsfat%3Fformat%3Djpg%26name%3Dlarge))

![](https://wink.run/image?url=%5Bhttps%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufeoUXgAEPMnH%3Fformat%3Djpg%26name%3Dlarge%5D(https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufeoUXgAEPMnH%3Fformat%3Djpg%26name%3Dlarge))

For example, some models are responsible for parsing goals (Goal g) from human instructions, others generate actions (Action a_t) based on the current state (State S_t), some directly learn reward functions (Reward r_{t+1}), and some even predict the next state (State S_{t+1}). These images break down the process in detail.

![](https://wink.run/image?url=%5Bhttps%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufnlzXYAE9pis%3Fformat%3Djpg%26name%3Dlarge%5D(https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufnlzXYAE9pis%3Fformat%3Djpg%26name%3Dlarge))

![](https://wink.run/image?url=%5Bhttps%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufqs9XwAAnSJN%3Fformat%3Djpg%26name%3Dlarge%5D(https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG3ufqs9XwAAnSJN%3Fformat%3Djpg%26name%3Dlarge))

The advantage of this approach is that no matter how foundation models evolve—whether they are language models or vision models, generative or discriminative—you can always fit them into some part of the MDP to see if they're suitable. Rather than being led by the models.

The tutorial also discusses specific cases, such as how large models understand open-ended instructions, how multimodal models process visual signals from the physical world, and how to use models to reduce reliance on人工奖励 design.

Many people at the event asked if there would be a recording. There is currently no public recording plan, but the tutorial slides are already available on the website. If you're interested, you can check them out yourself.

- Time: 1 PM - 5 PM (HST) on October 20

- Location: Room 306B, Hawaii Convention Center

- Slides and Materials: [https://foundation-models-meet-embodied-agents.github.io](https://foundation-models-meet-embodied-agents.github.io)

The tutorial was prepared together with Yunzhu Li (Columbia), Jiayuan Mao (MIT), and Wenlong Huang (Stanford). If you're also researching how to make foundation models more 'grounded'—grounded in the physical world—you might find some references here.

Wink Pings

ICCV 2025 Tutorial Update: How Embodied Agents Can Leverage Foundation Models