Intel Arc Pro B60 Benchmark Results

Feb 14, 2026

4 mins read

Benchmarking LLM Inference on Intel Arc Pro B60: A Comparative Analysis of vLLM

BY EmbeddedLLM Team

Background

As organizations deploy large language models in production, hardware costs become a significant consideration. NVIDIA GPUs, while performant, represent a substantial investment. We were interested in evaluating whether Intel Arc Pro B60 could serve as a viable alternative for inference workloads.

The B60 offers specifications that warrant investigation:

  • 24GB VRAM - sufficient for 30B parameter models
  • 456 GB/s memory bandwidth
  • 160 XMX engines for matrix operations
  • Compatibility with established inference frameworks

This report documents our benchmarking methodology and findings.

Test Configuration

We would like to acknowledge Sparkle for providing the test hardware - a 4x Intel Arc Pro B60 server. This configuration allowed us to test with tensor parallelism (TP=4), which is representative of production deployments for models of this size.

Hardware

ComponentSpecification
GPU4x Intel Arc Pro B60
VRAM per GPU24GB
Memory Bandwidth456 GB/s per GPU
Tensor Parallelism4 (across all 4 GPUs)

Software

We tested two configurations:

  • vLLM 0.12: Built from source following the official documentation, packaged as a Docker image
  • LLM-Scaler 1.2: An Intel-optimized build of vLLM, also distributed as a Docker image. This is not a different engine, but rather vLLM with Intel-specific optimizations.

Both configurations used Qwen/Qwen3-VL-30B-A3B-Instruct as the test model.

Workloads Evaluated

We selected workload patterns that represent common production scenarios:

Test scenarios and B60 scaling characteristics Figure 1: Test scenarios (left) and B60 scaling characteristics (right)

Scenario Descriptions

Customer Support Chat (1K → 1K): Short-form conversational interactions where users ask questions and receive concise answers. Typical exchanges are 500-1500 words.

Decision Support & Action Plan (1K → 2K): Complex tasks requiring reasoning and analysis. The model receives detailed context and produces comprehensive responses including analysis, recommendations, and actionable plans.

Document Writing Assistant (1K → 2K): Long-form content generation where users provide outlines or context, and the model generates complete documents, articles, or reports.

Concurrency Levels

We tested at 16, 32, and 64 concurrent requests to understand how the system behaves under increasing load. This range covers typical deployment scenarios from small teams to moderate-scale production environments.

Results

Performance comparison across metrics Figure 2: Performance comparison across metrics

Throughput Scaling

Both configurations demonstrated linear scaling from 16 to 64 concurrent requests. Throughput approximately doubled when moving from 16 to 32 requests, and doubled again from 32 to 64. This indicates the B60 can handle increased load without significant degradation.

Time to First Token (TTFT)

TTFT measures the latency from request submission to the first token being returned. At 16 concurrent requests:

ScenariovLLMLLM-Scaler
Chat (1K→1K)2.1 seconds4.2 seconds
Decision Support (1K→2K)1.8 seconds2.8 seconds

vLLM consistently shows lower initial latency. This difference is most pronounced in the Chat scenario.

Time Per Output Token (TPOT)

TPOT measures the inter-token latency during generation - effectively how smoothly the output streams to the user. Lower values indicate faster per-token generation. At 16 concurrent requests:

ScenariovLLMLLM-Scaler
Chat (1K→1K)61 ms/token48 ms/token
Decision Support (1K→2K)59 ms/token44 ms/token

The Intel-optimized LLM-Scaler build shows approximately 20-25% improvement in per-token generation time. For long-form content generation, this translates to noticeably faster completion times.

Detailed performance metrics Figure 3: Detailed benchmark metrics across all scenarios and concurrency levels

Key performance insights Figure 4: Low concurrency TPOT comparison and Chat scenario TTFT analysis

Analysis

Scaling Characteristics

The linear scaling observed up to 64 concurrent requests is noteworthy. This suggests the B60 can serve as the foundation for production deployments handling moderate concurrent load. For context, 64 concurrent users could represent:

  • A customer support team of 20-30 agents, each handling 2-3 simultaneous conversations
  • An internal AI assistant serving 200 employees with 30-40 concurrent users at peak
  • A decision support system with moderate user concurrency

Configuration Trade-offs

The choice between vLLM and the Intel-optimized build involves trade-offs:

vLLM (Standard Build)LLM-Scaler (Intel Optimized)
Lower TTFT
Faster initial response
Good for search-like interactions
Lower TPOT
Faster token streaming
Good for chat-like interactions
Example: FAQ bots, quick lookup toolsExample: Writing assistants, decision support systems

Both builds deliver comparable throughput. The decision depends on which latency metric matters more for your use case.

Batch vision challenge and engine selection guide Figure 5: Batch vision handling comparison and when to choose which engine on Intel Arc Pro B60

Conclusions

Based on our testing, the Intel Arc Pro B60 demonstrates viable performance for inference workloads with the following characteristics:

  • Serves 50-64 concurrent requests with linear scaling
  • Achieves ~1000 tokens/s throughput for long-form generation at peak load
  • Provides sub-5 second TTFT for typical workloads
  • Benefits from Intel-specific optimizations in the LLM-Scaler build

For organizations evaluating hardware options, the B60 represents a cost-effective entry point for inference deployments. It is not a replacement for high-end NVIDIA hardware in all scenarios, but it offers a practical alternative for moderate-scale production workloads.

The availability of an Intel-optimized vLLM build is a positive development, providing measurable performance improvements without requiring changes to application code.

Acknowledgment

We would like to thank Sparkle for providing the 4x Intel Arc Pro B60 server used for this benchmarking. We also thank Chendi Xue from Intel for technical support on vLLM XPU build. Their support made this analysis possible.

Appendix: Technical Details

For reproducibility, the complete test parameters are as follows:

ParameterValue
Hardware ProviderSparkle (https://www.sparkle.com.tw/)
GPU Configuration4x Intel Arc Pro B60
VRAM24GB per GPU
Memory Bandwidth456 GB/s per GPU
Tensor Parallelism4 (TP=4 across all GPUs)
ModelQwen/Qwen3-VL-30B-A3B-Instruct
vLLM BuildBuilt from source, v0.12
LLM-Scaler BuildIntel-optimized vLLM, v1.2
Concurrency Levels16, 32, 64

EmbeddedLLM Logo

Embark your company’s journey with the next-gen AI powered platform. Get a quote now.

Legal

Terms and Conditions

Privacy Policy

© 2023 Embedded LLM. All rights reserved.