Wink Pings

Azure Foundry Adds 7 New Models Including GPT-5.5: Open-Source Tool Solves the Common Problem of Unclear ROI for Multi-Model Setups

On May 19, 2026, Microsoft Azure announced two updates for its Foundry Model Router: 7 cutting-edge large models including GPT-5.5, Claude-Opus-4.7, and grok-4.1-fast-reasoning have been added, and an open-source automated evaluation repository has been released alongside. This tool enables one-click calculation of the actual quality, cost, and latency benefits of model routing, directly addressing the widespread pain point of unquantifiable return on investment that plagues the deployment of modern multi-model and Agent applications.

On May 19, 2026, Microsoft Azure officially launched two updates for Foundry Model Router.

The first update is an expansion of the model catalog, adding 7 cutting-edge large models: gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5.3-chat, Claude-Opus-4.7, gpt-5.5, and grok-4.1-fast-reasoning. Including previously supported models, Foundry Model Router now integrates a total of 28 leading large models. Developers do not need to integrate APIs from multiple providers separately; the router automatically selects the most suitable model in real time based on the complexity of the prompt, inference requirements, and task type.

The second update is the launch of an open-source automated evaluation repository purpose-built for model routing.

Just four days before this update, Gerard Sans from Axiom pointed out on social media that for many enterprises, the high-frequency adoption of Agentic AI has not delivered the expected high returns. Core issues include unreliable outputs requiring extensive manual correction, frequent loops of repetitive work, token usage that typically exceeds expectations by 5 to 20 times, and rising fixed costs for monitoring, failure recovery, and human oversight. The longer the system runs, the higher the costs tend to be, rather than generating more output value. Many enterprises have begun auditing their spending and found that the ROI of their AI investments does not add up.

The growing adoption of multi-model routing has added new variables to cost calculation: when developers only used a single model before, it was easy to calculate costs and measure quality, but after switching to automatic routing, no existing tools can answer these practical questions clearly:

- Does the model automatically selected by the router deliver comparable quality to the fixed model previously used, on your business-specific prompts?

- The router charges a service fee for processing input prompts, plus the cost of the underlying models. Does this actually save money overall, or just shift costs around?

- Does the time taken for the router's own decision-making, combined with the response time of the selected model, completely offset the speed advantage of small models?

- If you can only use a specific subset of models to meet compliance requirements, how much of a trade-off will you have to make in terms of quality and cost?

The open-source evaluation tool released by Azure directly addresses all of these problems. It can run locally without requiring access to enterprise-level Foundry services, and its core capabilities include:

- Outputs three core metrics — quality, cost, and latency — in a single run, eliminating the need for separate calculations

- Cost calculation automatically includes the router's input prompt markup and the pricing of the actually called underlying models, so no hidden costs are missed

- Uses a paired-sequence LLM-as-a-judge scoring mechanism to eliminate position bias, resulting in more reliable scoring

- Directly generates two composite metrics: quality per dollar and quality per second, making trade-offs between different solutions immediately clear

- Outputs the distribution of underlying models actually called for each request, helping developers verify the actual performance of the balance, cost, and quality routing modes, and check whether the model subset configuration matches expectations

- Optional synchronization of results to Foundry's enterprise toolchain, making it directly usable for enterprises with compliance and governance requirements

### Notes for Usage

The following are pre-conditions explicitly stated by the official documentation to help users avoid common pitfalls:

1. The effective context window of the router is equal to that of the smallest model among all integrated models. If a prompt exceeding this length is routed to a small model, it will throw an error directly

2. Claude series models require separate pre-deployment by developers; the router will not automatically create Claude deployment instances

3. Currently, routing decisions are only made based on text content. Input images do not affect routing results, and audio input is not supported

4. Model routing is currently only available in two regions: East US 2 and Central Sweden, with deployment types Global Standard and Data Zone Standard

### Quick Start Steps

The evaluation tool requires Python 3.9 or higher. You can see a demo of the results without an API key by running these commands:

```bash

# Run demo on macOS / Linux

bash scripts/demo.sh

# Run demo on Windows

.\scripts\demo.ps1

```

The demo runs on simulated data and displays a complete results dashboard, no API calls are required.

Running a full evaluation on your own business data only takes 7 steps:

1. Clone the repository and install dependencies

```bash

git clone https://github.com/microsoft/foundry-model-router-autoeval.git

cd foundry-model-router-autoeval

pip install -e ".[dev]"

```

2. Copy the .env example file and fill in three sets of credentials: the endpoint and key for model routing, deployment information for the baseline model (such as a fixed GPT-5 deployment) for comparison, and deployment information for the judge model used for scoring

3. Adjust the configuration file. Three pre-built configuration templates are available: quick test, large-scale test, and Foundry integration

4. Import your own business prompts. Supported formats include JSONL, CSV, and SQL databases, only `id` and `prompt` are required fields

5. Run the evaluation. It supports dry runs (check configuration without calling APIs) and resume from breakpoints. For large-scale tests, you can refer to the official scaling documentation to adjust the concurrency limit and avoid rate limits

6. View the results. Outputs include a standalone HTML dashboard with 8 built-in charts, a markdown summary, machine-readable JSON results, and a detailed CSV record for each prompt including the actual model called for each individual request

7. Optional operations: you can compare results from multiple runs (for example, the difference between balance mode and cost mode), or synchronize results to the Foundry enterprise evaluation platform

### Related Links

- [Open-source evaluation repository](https://aka.ms/modelrouter/evaluations)

- [Foundry Model Router official documentation](https://learn.microsoft.com/azure/ai-foundry/concepts/model-router)

After the announcement, one developer commented that right now AI companies are competing on the intelligence of individual models, while cloud providers are building the operating system for all intelligence. It's completely two different tracks of competition. Another user expressed curiosity about the newly added grok-4.1-fast-reasoning, asking about the actual performance of the model.

发布时间: 2026-05-20 06:59