AI in May 2026

The 30 most significant AI stories from May 2026, ranked by signal and clustered into stories.

DwarfStar 4 Local Inference Engine Released

Salvatore Sanfilippo (creator of Redis) has released DwarfStar 4, a new local inference engine built from scratch for DeepSeek V4 Flash. It features asymmetric 2/8 bit quantization, KV cache on disk, an OpenAI/Anthropic compatible API, and support for steering vectors. The content explains what it is and how it works.

r/singularity deep-dive

Nous Research Speeds LLM Pre-Training by 2.5x with Token Superposition

Nous Research has released Token Superposition Training (TST), a new method claimed to speed up LLM pre-training by up to 2.5 times for models ranging from 270 million to 10 billion parameters. The release includes an arXiv paper and a dedicated project page detailing the approach. This aims to provide meaningful cost and time savings in pre-training large language models.

Eric Tech technical

Graphify Optimizes Claude Code Performance with Knowledge Graphs

This video introduces "Graphify," a tool designed to optimize Claude Code's performance and reduce token usage by converting a codebase into a structured knowledge graph. The creator claims it reduced token usage by 27x on their production SaaS, preventing Claude Code from repeatedly re-reading files.

GitHub deep-dive

Ray AI compute engine accelerates ML workloads

Ray is an open-source distributed computing framework designed as an AI compute engine, providing a core distributed runtime and a suite of AI libraries. It accelerates various machine learning workloads, including model training, hyperparameter optimization, and LLM inference and serving. Ray is widely used for scaling AI applications.

GitHub technical

SGLang offers high-performance LLM/multimodal serving

SGLang is an open-source, high-performance serving framework designed for large language models and multimodal models. It aims to optimize the serving of these models, supporting various architectures like Llama and DeepSeek, and leveraging technologies like CUDA. The project has gained substantial popularity for its efficiency in model deployment.

Hugging Face Blog technical

Ettin Reranker Family Introduced for Retrieval Systems

The Ettin Reranker Family has been introduced, likely as a new set of models designed to improve the relevance and quality of retrieved information. Rerankers are critical components in systems like Retrieval-Augmented Generation (RAG) for refining search results.

r/MachineLearning deep-dive

Free 9.8M Document Indic Multilingual Corpus Released

A free, open-source Indic multilingual corpus has been released, containing approximately 9.8 million web documents across 11 languages, including Hindi, Bengali, Tamil, and Telugu. The dataset comprises about 8.4 billion tokens, is licensed under CC0, and is available on HuggingFace.

Hacker News accessible

Google Deprecates Gemini CLI, Advises Antigravity Transition

Google has announced that the Gemini CLI will be deprecated and will stop functioning from June 18, 2026. Users are advised to transition to the new Antigravity CLI, which will replace the functionality of the existing Gemini command-line interface.

r/LocalLLaMA deep-dive

llama.cpp Integrates MTP for Faster Inference

MTP (Multi-Token Prediction) speculative decoding has been integrated into mainline llama.cpp, significantly boosting inference performance. Benchmarks show Qwen3.6 27B running 2.44x faster on a Strix Halo and 2.17x faster on an RTX 3090 rig for specific quantization levels. This update was merged on May 16.

r/StableDiffusion technical

LTX-2.3 PolarQuant Q5 Achieves 88% Size Reduction

An announcement highlights LTX-2.3 PolarQuant Q5, a quantized version of a model achieving an 88% size reduction with near-lossless quality, indicated by a Cosine Similarity of 0.9986. Links to a GitHub repository and a Hugging Face model page are provided for access.

r/ClaudeAI technical

Claude Code Ships "Run Until Done" Mode for Autonomous Goal Completion

Claude Code has released a new "run until done" mode, accessible via the `/goal` command in version 2.1.139. This feature allows users to define a completion condition, such as "all tests pass," and Claude Code will autonomously work across multiple turns until the goal is met.

Fahd Mirza deep-dive

Google Releases Gemma 4 MTP Drafters - Run Locally and DFlash Comparison

Google has officially released the MTP Drafter models for the Gemma 4 family. This video demonstrates running Gemma 4 31B locally on an H100 GPU with the new MTP drafter enabled and provides a comparison against DFlash. The MTP Drafters are designed to enhance inference speed.

r/LocalLLaMA deep-dive

Hugging Face Releases Carbon DNA Foundation Models

Hugging Face has released Carbon, a new family of open DNA foundation models designed to decode the "language of life." The Carbon-3B model is claimed to match the current state-of-the-art performance of Evo2-7B in its specific domain. This represents a significant development in AI for bioinformatics.

r/LocalLLaMA deep-dive

llama.cpp Integrates llama-eval for Local Model Benchmarking

A pull request to llama.cpp by ggerganov introduces `llama-eval`, a new tool that allows users to evaluate their models locally. This tool supports datasets like AIME, AIME2025, GSM8K, and GPQA, enabling comparison of quantizations and finetunes.

r/LocalLLaMA deep-dive

Transformer Language Model Runs Locally on Stock Game Boy Color

A developer successfully demonstrated running Andrej Karpathy's TinyStories-260K transformer language model locally on a stock Game Boy Color. This was achieved by converting the model to INT8 weights with fixed-point math, enabling it to operate without floating-point support on the highly constrained hardware.

r/LocalLLaMA technical

Achieve 24+ tok/s with 30B MoE Models on GTX 1080

A demonstration shows achieving over 24 tokens/second with ~30B MoE models (Qwen 3.6 35B-A3B and Gemma 4 26B-A4B) on an older GTX 1080 GPU with 8 GB VRAM and 128k context. This was accomplished using llama.cpp's TurboQuant/RotorQuant KV cache quantization.

r/MachineLearning technical

NuExtract3 VLM Released for Document Information Extraction

Numind has released NuExtract3, an open-weight 4B Vision-Language Model (VLM) based on Qwen3.5-4B, available under an Apache-2.0 license. This self-hostable model is designed for practical information extraction from complex documents, including Markdown, OCR, PDFs, screenshots, forms, tables, receipts, and invoices.

GitHub technical

vllm-project/semantic-router: System Level Intelligent Router for Mixture-of-Mod

Semantic Router, from the vLLM project, is an open-source system-level intelligent router designed for managing and directing traffic to a mixture of AI models across cloud, data center, and edge environments. It aims to optimize model usage and deployment by intelligently routing requests based on semantic understanding. The project mentions support for various technologies including Hugging Face Candle and Kubernetes.

ArXiv deep-dive

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT (MindLab Toolkit) is a managed infrastructure system designed for the post-training and online serving of millions of LLMs using Low-Rank Adaptation (LoRA). It aims to optimize scenarios where numerous fine-tuned policies are derived from a few expensive base models, avoiding the overhead of materializing each policy as a full merge.

Simon Willison deep-dive

Datasette Agent Integrates AI for Data Exploration

Simon Willison announced the first release of Datasette Agent, a new extensible AI assistant for Datasette. This tool integrates the LLM Python library with Datasette, aiming to provide AI assistance for data exploration and analysis.

AI Engineer technical

Google DeepMind Shares Gemini API Conversational Agent Building

Thor Schaeff and Philipp Schmid from Google DeepMind present a session on building conversational agents using Gemini APIs. The presentation covers the new Interactions API, agent skills, server-side state management, and the Live API workflow for real-time audio and video streaming, including tool-using coding agents.

Google AI Blog technical

Google I/O 2026 Highlights Major AI Announcements

Google announced "100 things" at I/O 2026, including major releases like Gemini Omni, Google Antigravity, and Universal Cart. This article serves as a comprehensive highlight reel of all new products, features, and initiatives. It provides a broad overview of Google's latest advancements.

r/StableDiffusion deep-dive

Pixel-space AsymFLUX.2 klein Model Released with ComfyUI

The Pixel-space AsymFLUX.2 klein model has been released, along with its SFT variants and a dedicated ComfyUI extension and workflows. The release includes HuggingFace links for a demo and the models, as well as a GitHub repository for the ComfyUI integration, providing new tools for image generation.

Hacker News deep-dive

CODA Optimizes Transformer Computations with GEMM-Epilogue Programs

This arXiv paper introduces CODA, a method for rewriting Transformer blocks as GEMM-epilogue programs. The research aims to improve the efficiency of Transformer computations by optimizing their underlying matrix multiplication operations. It focuses on low-level performance enhancements for AI models.

r/MachineLearning technical

Dropbox Open-Sources Witchcraft for Local Semantic Search

Dropbox has open-sourced "Witchcraft," a re-implementation of Stanford's XTR-Warp semantic search engine in Rust, which provides fast local semantic search capabilities using a single-file SQLite database for storage.

Hacker News deep-dive

Research Introduces Multi-Stream LLMs for Enhanced Efficiency

A new research paper introduces the concept of "Multi-Stream LLMs," focusing on methods for parallelizing and separating prompts, internal thought processes, and input/output operations within large language models. This approach aims to enhance the efficiency and capability of LLMs in complex tasks.

Hacker News deep-dive

Forge Open-Source Guardrails Boost Agentic Task Reliability

Forge is an open-source reliability layer for self-hosted LLM tool-calling, developed by an AI Director at Texas Instruments. It claims to improve an 8B model's performance on agentic tasks from 53% to 99% by adding domain-and-tool-agnostic guardrails, including retry nudges, step enforcement, error recovery, and VRAM-aware context management.

r/LocalLLaMA deep-dive

Fine-Tuned Cohere Transcribe Adds Diarization, Timestamps

A developer has fine-tuned Cohere Transcribe, an open-source speech-to-text model, to add support for diarization (speaker identification) and timestamps. This addresses a previous limitation of the model, despite the necessary tokens being present in its tokenizer. This enhancement significantly improves the model's practical utility.

Hacker News accessible

OpenAI Adopts Google SynthID for AI Image Watermarking

OpenAI has announced its adoption of Google's SynthID watermarking technology for AI-generated images, alongside the release of a verification tool. This initiative aims to embed imperceptible digital watermarks into images created by OpenAI's models, enhancing content provenance and helping users identify AI-generated content. The verification tool allows for checking the authenticity of such images.

r/LocalLLaMA technical

Open-Source Tool Generates Articulated 3D Objects

A developer has released an open-source tool on GitHub that generates 3D objects with functional, articulated parts, addressing a limitation where most text-to-3D pipelines produce monolithic blobs. The tool is demonstrated generating a 3D washing machine with rotating internal assembly and is described as mostly LLM-agnostic.

Top stories