The Wire — 2026-06-08

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

Agent-eval is an open-source adversarial evaluation framework that runs full ReAct agentic loops with tool calls against live LLM backends, then scores outputs through a three-tier assertion pyramid (deterministic, heuristic, model-as-judge). Testing 5 models (including Llama 3.3 70B via Groq) on 10 adversarial scenarios—prompt injection via tool output, hallucinated file contents, sycophancy, and circular dependency chains—the best model scored 62.5% and the worst 34%, with every model failing the same three tests. The framework short-circuits upward: if Tier 1 deterministic checks catch a prompt injection, it skips expensive LLM judge calls.

Why it matters

For engineers building agentic systems with tool-calling LLMs, this reveals that current models are uniformly vulnerable to practical attacks like prompt injection embedded in tool outputs and sycophantic code review, and that most existing evals miss these failure modes entirely.

Languages / dev.to

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

pgvector 0.8.0's iterative HNSW scans make filtered similarity search practical, enabling a full RAG pipeline in TypeScript using a single Postgres instance instead of dedicated vector databases. The pipeline chunks documents, embeds them with OpenAI's text-embedding-3-small (1,536 dims), stores in pgvector with an HNSW index, retrieves by cosine distance, and augments an LLM call (Claude or GPT-4o). At 10M documents, top-10 results return in under 10ms, collapsing three infrastructure components into one connection string and backup strategy.

AI/ML / infoq.com

Terraform 1.15 Closes Gap to OpenTofu on Dynamic Sources and Deprecation

HashiCorp released Terraform 1.15 with dynamic module sources via a new `const` attribute on variables, enabling environment-specific registries and version pins without module duplication. The release also adds a `deprecated` attribute for variables and outputs, a `convert` function for explicit type coercion, type constraints on output blocks, and native Windows ARM64 support. These features close a nearly two-year gap with OpenTofu, which shipped equivalent early variable evaluation in version 1.8.0 (August 2024) and refined it in 1.9.

AI/ML / infoq.com

Presentation: Beyond Speed Limits: Exploring the Performance Power of Valkey

Viktor Vedmich, an AWS Senior Solutions Architect, presents Valkey as an open-source Redis fork under the Linux Foundation/CNCF with 100% API compatibility, created in 2024 after Redis changed its licensing. He details advanced caching strategies like lazy loading and data structures for real-time analytics, rate limiting, and session stores to solve the thundering herd problem, noting Valkey runs on AWS ElastiCache as a managed service with contributions from 40 organizations and 150 developers.

AI/ML / thenewstack.io

Microsoft just made the agent runtime free — and kept everything around it

Microsoft shipped Scout, its first always-on work agent, on OpenClaw—an open-source runtime built by an Austrian developer in a weekend—and contributed enterprise policy controls back upstream. At Build 2026, Microsoft positioned OpenClaw as the free common base akin to Android, while monetizing the control plane above it: governed Entra identities, policy engines, and audit logs. Nvidia and Nous Research are also building on OpenClaw, which now runs natively in Windows Execution Containers, making the runtime a commodity and the governance layer the business.

AI/ML / cncf.io

Benchmarking KubeVirt performance with virtbench

This article likely covers the use of virtbench, a benchmarking tool, to measure and analyze the performance of KubeVirt, which runs virtual machines (VMs) on Kubernetes. It appears to address the challenge that standard Kubernetes observability tools are not optimized for VM-specific metrics, and introduces virtbench as a solution for evaluating KubeVirt's performance characteristics.

General / performance.dev

How's Linear so fast? A technical breakdown

Linear achieves sub-100ms UI updates by inverting the traditional client-server model: the browser's IndexedDB acts as the primary database, with mutations applied locally to MobX observables before being asynchronously synced via WebSocket. This eliminates network round-trips from the critical rendering path, making spinners and loading states unnecessary. Co-founder Tuomas built the custom sync engine as the first lines of code, prioritizing this architecture from day one.

DevTools / dev.to

I Built a Browser-to-Browser Video Chat in 250 Lines — Zero Backend, Zero SDKs, Zero Cost

A developer built a browser-to-browser video chat in ~250 lines using WebRTC's three core APIs (getUserMedia, RTCPeerConnection, and manual SDP/ICE signaling) with zero backend, SDKs, or cost. The signaling handshake is reduced to three messages—offer, answer, and confirmation—where users copy-paste JSON blobs between tabs instead of relying on WebSocket or Firebase. The 8-commit Next.js repo demonstrates each step from local webcam preview to full peer-to-peer video, using Google's STUN server for NAT traversal.

Cloud / dev.to

The State of Apache Iceberg Catalogs in June 2026

Apache Iceberg has won the table format war, and the battleground has shifted to the catalog layer, which now governs metadata, access control, credential vending, and commit sequencing. The Iceberg REST Catalog specification has standardized interoperability, enabling server-side scan planning in Iceberg 1.11 that lets catalogs apply row filters and column masks before engines see data. Apache Polaris graduated to a top-level project, Snowflake and Databricks are competing on catalog interoperability, and a security company acquired an Iceberg operations startup for $9B, signaling the catalog's strategic importance for AI agent workloads.

Comparison of two AI models' precision and instruction-following ability (1970s offset-print magazine illustration, featuring halftone dots, slightly off-register inks, and a warm, yellowed paper texture.)

AI/ML / runtimewire.com

DeepSeek V4 Pro beats GPT-5.5 Pro on precision

This article appears to announce that DeepSeek V4 Pro outperforms GPT-5.5 Pro in precision-focused tasks, specifically in instruction following, schema matching, and edge case handling. The comparison likely highlights a shift in AI model competition toward reliability and exactness rather than raw capability.