Home
News
Tech Grid
Interviews
Anecdotes
Think Stack
Press Releases
Articles
  • Cloud AI

WEKA Benchmarks Show 10x AI Inference Efficiency on OCI


WEKA Benchmarks Show 10x AI Inference Efficiency on OCI
  • by: PR Newswire
  • |
  • June 12, 2026

WEKA, an AI data and memory infrastructure company, has released new production-scale benchmark results demonstrating significant improvements in the economics and performance of long-context AI inference. Conducted on Oracle Cloud Infrastructure (OCI), the benchmarks highlight how WEKA’s NeuralMesh platform with Augmented Memory Grid enables organizations to serve substantially more users and tokens on the same GPU footprint without additional infrastructure.

The results were validated on a nine-node OCI bare-metal H100 cluster and focus on long-context inference workloads involving up to 100,000-token context windows, reflecting growing enterprise demand for high-throughput AI systems.

Quick Intel

  • WEKA benchmarks show 10x more concurrent users served using the same GPU infrastructure.

  • Token throughput increased 10x, reaching approximately 2 million tokens per second.

  • The system delivered 7x more tokens per GPU compared to DRAM-only configurations.

  • Testing was conducted on a nine-node OCI bare-metal H100 cluster with 72 GPUs.

  • NeuralMesh uses Augmented Memory Grid to decouple KV cache from GPU memory and improve scalability.

  • The solution targets long-context and agentic AI workloads requiring large-scale inference efficiency.

Addressing the GPU Memory Bottleneck in AI Inference

As AI workloads evolve toward longer context windows and more complex agentic workflows, infrastructure constraints around GPU memory and cache management have become a critical bottleneck.

WEKA’s benchmark results focus on this challenge, showing how inefficient memory utilization can limit concurrency, throughput, and overall system economics in production AI environments. According to the company, traditional DRAM-bound configurations struggle to maintain performance under high-load inference scenarios, particularly when context windows expand to 100,000 tokens or more.

The benchmarks demonstrate how NeuralMesh with Augmented Memory Grid improves system efficiency by expanding usable memory capacity and reducing dependency on local GPU DRAM.

10x Improvement in Concurrent Users and Throughput

A key finding from the benchmark is a substantial increase in system concurrency and processing capacity.

WEKA reports that the platform supported more than 5,000 concurrent users, compared to approximately 600 in DRAM-only configurations, without requiring additional infrastructure. This improvement is attributed to expanded effective cache capacity and more efficient memory utilization across the cluster.

The system also achieved approximately 10x higher token throughput, processing close to 2 million tokens per second compared to under 200,000 tokens per second in baseline configurations. This level of throughput is particularly significant for real-time AI applications such as search, summarization, coding assistance, and multi-turn agent workflows.

7x Increase in Tokens Served per GPU

In addition to concurrency and throughput gains, WEKA highlighted improvements in total token output per GPU.

During a one-hour test involving 2,400 users, the system served approximately 5 billion tokens compared to 700 million in DRAM-only configurations. This represents a 7x increase in total tokens processed, significantly improving cost efficiency per token and overall GPU utilization.

The company noted that traditional architectures often suffer from cache eviction and recomputation overheads, which degrade performance and increase operational costs at scale.

Augmented Memory Grid Redefines Inference Architecture

At the core of WEKA’s approach is Augmented Memory Grid, a capability within the NeuralMesh platform designed to decouple key-value (KV) cache from local GPU memory.

Instead of relying solely on DRAM, the system stores KV cache in a distributed, high-performance token memory layer that can be accessed across the cluster. This allows any node to serve any session without losing context, improving load balancing and reducing inefficiencies caused by session stickiness.

This architecture enables persistent context memory for AI agents and helps eliminate the memory bottleneck that typically constrains long-context inference workloads.

Industry Implications for AI Infrastructure Economics

WEKA positions these results as a shift in how organizations should think about AI infrastructure economics. As inference workloads scale, inefficiencies in memory management become increasingly costly, directly affecting latency, user experience, and GPU utilization.

By increasing effective memory capacity and improving cache efficiency, the company argues that organizations can significantly reduce cost per token while scaling AI services more effectively.

According to WEKA CEO Liran Zvibel:

"Inference is bottlenecked by how much effective memory is available to GPUs. These results prove that AI token economics aren't solved by hardware alone; they're solved by eliminating the memory wall that has been the real ceiling on what existing hardware can do. NeuralMesh with Augmented Memory Grid running on OCI brings orders of magnitude more tokens to customers in an extremely cost-efficient way."

Production Validation on Oracle Cloud Infrastructure

The benchmarks were conducted on a nine-node OCI bare-metal H100 cluster with 72 GPUs and validated under production-like conditions, including high concurrency and long context windows.

Oracle noted that the results demonstrate how memory optimization can remove key bottlenecks in large-scale inference systems, enabling more efficient use of GPU resources without additional hardware investment.

"This shows how WEKA's NeuralMesh platform with Augmented Memory Grid on OCI helps remove memory bottlenecks so customers can support larger, more demanding inference workloads without simply adding more GPUs," said Pablo Selem, Senior Director of Software Development at Oracle Cloud Infrastructure.

Expanding Access via Oracle Marketplace

NeuralMesh with Augmented Memory Grid is now generally available to WEKA customers and accessible through Oracle Marketplace, with OCI serving as the company’s exclusive cloud launch partner.

The solution is positioned for organizations running long-context inference workloads that require high concurrency, persistent context handling, and improved cost efficiency at scale.

About WEKA

WEKA is the AI data and memory infrastructure company transforming the economics of agentic AI. Its NeuralMesh™ platform unifies high-performance data storage with extended GPU memory, giving enterprises, AI cloud providers, and AI builders a single foundation for training, inference, and agentic workloads. With Augmented Memory Grid, NeuralMesh extends GPU memory capacity by 1000x, accelerates time to first token by up to 20x, and delivers 10x more concurrent users from the same GPU footprint, proven in production benchmarks. Trusted by 30% of the Fortune 50, WEKA enables organizations to scale AI faster, optimize GPU utilization, and reduce the cost of every token served. Learn more at www.weka.io or connect with us on LinkedIn and X.

  • Artificial IntelligenceOracle CloudMachine LearningAgentic AICloud Computing
News Disclaimer
  • Share