Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/We Built a Hardware MCP Server: Auto-Detect Your GPU and Get the Right LLM

March 27, 2026tutorial

Runyard Team

@runyard_dev

10 min read

Contents

▸What Is MCP and Why Does It Matter Now?▸What Problem Does @runyarddev/hw-mcp Solve?▸The 4 MCP Tools: A Complete Walkthrough ▸Hardware Detection: How the Fallback Chain Works ▸How the Scoring Algorithm Works ▸How to Install in Claude Desktop ▸How to Install in Cursor ▸What Does a Real Agent Conversation Look Like?▸Supported Backends and Expected Performance ▸Tips for Getting the Most Out of the Server ▸What's Next for @runyarddev/hw-mcp?▸Start Using @runyarddev/hw-mcp Today

We Built a Hardware MCP Server: Auto-Detect Your GPU and Get the Right LLM

AI processor chip on circuit board representing GPU-accelerated local model inference — Runyard's hardware MCP server auto-detects your GPU and VRAM before recommending any model.

@runyarddev/hw-mcp is a free, open-source MCP server (MIT, Node 18+) that auto-detects your GPU, VRAM, and RAM across CUDA, Metal, ROCm, and CPU backends, then ranks 900+ LLMs by hardware fit. Any MCP-compatible agent — Claude Desktop, Cursor, or a custom AI pipeline — gets instant, accurate hardware awareness with a single npx command.

Key Takeaways: MCP SDK monthly downloads hit 97M by March 2026 (Digital Applied), up from 2M at launch. @runyarddev/hw-mcp exposes 4 tools: detect_hardware, score_models, recommend_model, and parse_hardware_text. The scoring engine covers 6 quantization levels and 8 use cases. Install in Claude Desktop or Cursor with one JSON config block — no API key required.

What Is MCP and Why Does It Matter Now?

Anthropic launched the Model Context Protocol (MCP) on November 25, 2024. Monthly SDK downloads grew from 2M at launch to 97M by March 2026 (Digital Applied, March 2026) — a 48x increase in 16 months. MCP was donated to the Linux Foundation in December 2025, cementing it as a neutral, vendor-independent standard for agent tooling.

More than 5,800 MCP servers now exist (Digital Applied, March 2026), covering everything from database access to browser control. With 300+ MCP clients available (Official MCP Blog, November 2025), MCP has effectively become the USB-C of AI agent integration — one protocol, everything connects.

The AI agents market hit $5.43B in 2024 and is growing at a 45.82% CAGR (Master of Code, March 2026). Meanwhile, 79% of organizations have already adopted AI agents (PwC survey via Arcade.dev, 2025). Agents need tools. MCP is how those tools get wired up — and those tools increasingly need to understand the hardware they run on.

What Problem Does @runyarddev/hw-mcp Solve?

Local LLM adoption is accelerating fast. Ollama alone has 154,856 GitHub stars and 180% year-over-year growth (Vibe Data, November 2025). Yet every AI agent — whether Claude in your desktop or a custom pipeline — is completely blind to the hardware it runs on. It cannot tell a 24GB RTX 4090 from a 6GB laptop GPU.

The result is wasted time. You ask an agent which model to run, it suggests Llama 3.1 70B, and you spend 20 minutes downloading 40GB only to watch it crash your machine. Or it suggests a tiny 3B model when your GPU could comfortably run a 32B. Both outcomes are frustrating — and both are avoidable.

Running local LLMs costs $1,500 one-time versus $24,000 per year for a 10-developer team using cloud APIs (Vibe Data, November 2025). That ROI only materializes if you're running the right models at the right quantization. @runyarddev/hw-mcp gives any MCP-compatible agent the hardware context it needs to make smart recommendations, automatically.

Most model recommendation tools require you to manually enter GPU specs, look up VRAM figures, and cross-reference a spreadsheet. @runyarddev/hw-mcp eliminates all three steps by detecting specs programmatically and cross-referencing a curated database of 187 GPUs. No forms, no config files, no manual inputs.

The 4 MCP Tools: A Complete Walkthrough

Tool 1: detect_hardware

detect_hardware takes no inputs and returns a full HWSpecs object. It runs a detection chain: nvidia-smi first (all platforms), then wmic on Windows, system_profiler on macOS, and glxinfo plus lspci on Linux. If those commands fail or return incomplete data, it cross-references a built-in database of 187 GPUs to fill in the blanks.

The returned object includes: cpu name, core count, total RAM in GB, GPU model, VRAM in GB, and the detected backend. Supported backends are CUDA, Metal, ROCm, SYCL, CPU ARM, and CPU x86. The response also includes a method string that explains which detection path succeeded, plus a warnings array for any partial detections.

Apple Silicon gets special handling. Unified memory means the GPU and CPU share the same pool, so VRAM is set equal to total RAM. An M3 Max with 64GB reports 64GB of effective VRAM — which is accurate, since llama.cpp and Ollama both use unified memory this way on Metal builds.

Tool 2: score_models

score_models accepts a HWSpecs object, plus optional parameters: use_case, fit_filter, and limit (default 20, max 100). It ranks the full Runyard catalog of 900+ LLMs and returns each model's composite score out of 100, the quantization used, estimated tokens per second, memory footprint in GB, and a fit level.

Fit levels have four tiers: Perfect, Good, Marginal, and TooTight. Run modes reflect how the model will execute: GPU (fits fully in VRAM), MoE (Mixture-of-Experts, only active expert parameters loaded), CPU+GPU (split across VRAM and RAM), CPU (full system RAM), and Too Tight (won't fit at any quantization).

Tool 3: recommend_model

recommend_model requires hw and use_case, with an optional priority: balanced, quality, speed, or memory. It returns a single best-fit model, a plain-English explanation of why it was chosen, a ready-to-run ollama pull command, and three alternative models for comparison. This is the tool to call when an agent needs one definitive answer.

Priority re-weights the composite scoring formula. Quality mode raises qualScore weighting to 70%. Speed mode raises speedScore to 60%. Memory mode raises fitScore to 60%. Balanced mode distributes weights evenly across all four scoring components.

Tool 4: parse_hardware_text

parse_hardware_text accepts a raw text string and parses it into a HWSpecs object. It understands the output formats of nvidia-smi, wmic, /proc/cpuinfo, and system_profiler. This is useful when your agent already captured terminal output from a prior shell command and needs to convert that text into structured hardware data without running duplicate system commands.

Hardware Detection: How the Fallback Chain Works

Building reliable cross-platform hardware detection is harder than it looks. nvidia-smi output formats differ between driver versions. macOS system_profiler changed its JSON schema in Sonoma. Linux lspci requires parsing unstructured text. The detection chain handles all of this with a graceful fallback sequence — each step is tried in order, and the first successful result wins.

detect-chain.ts (illustrative)typescript

// Simplified detection chain — illustrates the priority order
async function detectHardware(): Promise<{ hw: HWSpecs; method: string; warnings: string[] }> {
  const warnings: string[] = [];

  // Step 1: Try nvidia-smi (works on Windows, Linux, WSL2)
  const nvidiaSmi = await runCommand(
    'nvidia-smi --query-gpu=name,memory.total --format=csv,noheader'
  );
  if (nvidiaSmi.ok) {
    const hw = parseNvidiaSmi(nvidiaSmi.stdout);
    return { hw, method: 'nvidia-smi', warnings };
  }

  // Step 2: Platform-specific fallback
  if (platform === 'win32') {
    const wmic = await runCommand(
      'wmic path win32_VideoController get name,AdapterRAM'
    );
    if (wmic.ok) return { hw: parseWmic(wmic.stdout), method: 'wmic', warnings };
  }

  if (platform === 'darwin') {
    const sp = await runCommand('system_profiler SPDisplaysDataType -json');
    if (sp.ok) {
      // Apple Silicon: unified memory sets vram = total RAM
      return { hw: parseSystemProfiler(sp.stdout), method: 'system_profiler', warnings };
    }
  }

  if (platform === 'linux') {
    const glx = await runCommand('glxinfo | grep "OpenGL renderer"');
    if (glx.ok) {
      const hw = parseGlxinfo(glx.stdout);
      // Cross-reference GPU database of 187 GPUs to fill missing VRAM
      hw.vram = GPU_DATABASE[hw.gpu] ?? hw.vram;
      return { hw, method: 'glxinfo+db', warnings };
    }
    const lspci = await runCommand('lspci | grep -i vga');
    if (lspci.ok) return { hw: parseLspci(lspci.stdout), method: 'lspci', warnings };
  }

  // Step 3: CPU-only fallback
  warnings.push('No GPU detected — scoring will use CPU inference estimates');
  return { hw: await detectCpuOnly(), method: 'cpu-fallback', warnings };
}

The GPU database cross-reference is critical for ROCm users. AMD cards frequently report VRAM as 0 or N/A through generic interfaces. The 187-entry database maps GPU model names to known VRAM capacities, letting the scoring engine proceed accurately even when direct memory queries fail.

How the Scoring Algorithm Works

The scoring engine in score.ts uses six quantization levels and eight use cases to produce a composite score for every model in the catalog. Quantizations are: Q8_0 at 8.5 bits per weight, Q6_K at 6.6, Q5_K at 5.7, Q4_K at 4.8, Q3_K at 3.9, and Q2_K at 2.6. The engine always selects the highest quantization that fits within available VRAM.

VRAM requirement is calculated as: parameters x bpw / 8. For Mixture-of-Experts models, only the active expert parameter count is used — not the total. This matters for models like Mixtral 8x7B, which has 46.7B total parameters but only 12.9B active at once, making it fit on hardware that a 46B dense model never could.

Backend speed constants set the baseline tok/s: CUDA at 220, ROCm at 180, Metal at 160, SYCL at 100, CPU ARM at 90, and CPU x86 at 70. These multiply against model-specific efficiency factors derived from parameter count and quantization level to produce per-model tok/s estimates.

The composite score formula is: qualScore x wq + speedScore x ws + fitScore x wf + ctxScore x wc. The four weights shift depending on the selected use case. Reasoning raises qualScore weight. Chat raises speed weight. Embedding emphasizes fitScore and ctxScore. The eight supported use cases are: General, Coding, Reasoning, Chat, Multimodal, Embedding, ImageGen, and VideoGen.

MCP SDK Monthly Downloads — Growth Since Launch

Nov 2024

2M downloads/month

Jan 2025

8M downloads/month

Apr 2025

22M downloads/month

Jul 2025

45M downloads/month

Nov 2025

68M downloads/month

Mar 2026

97M downloads/month

How to Install in Claude Desktop

Installation requires no npm install step, no build, and no API key. Add a single entry to your claude_desktop_config.json and restart Claude Desktop. The server communicates over stdio using StdioServerTransport from @modelcontextprotocol/sdk v1.10.1. Claude spawns the server automatically via npx on first use.

Find your Claude Desktop config file. On macOS it lives at ~/Library/Application Support/Claude/claude_desktop_config.json. On Windows it's at %APPDATA%\Claude\claude_desktop_config.json. Open it and add the block below, then restart Claude Desktop.

claude_desktop_config.jsonjson

{
  "mcpServers": {
    "runyard": {
      "command": "npx",
      "args": ["-y", "@runyarddev/hw-mcp"]
    }
  }
}

Once installed, ask Claude: "What's the best local model for coding on my hardware?" Claude calls detect_hardware automatically, then score_models with the Coding use case, and returns a ranked list with ollama pull commands. No manual GPU lookup needed.

How to Install in Cursor

Cursor reads MCP server config from a .cursor/mcp.json file in your project root, or from a global MCP config in your home directory. The format is identical to Claude Desktop. Add the same block, save the file, and restart Cursor. The server becomes available to Cursor's agent mode immediately.

.cursor/mcp.jsonjson

{
  "mcpServers": {
    "runyard": {
      "command": "npx",
      "args": ["-y", "@runyarddev/hw-mcp"]
    }
  }
}

If npx is blocked on your network, install globally first with npm install -g @runyarddev/hw-mcp, then change the command to "runyard-hw-mcp" and remove the args array. The server exposes the same 4 tools either way.

What Does a Real Agent Conversation Look Like?

Here is a realistic example of an agent using all four tools in sequence — detecting hardware, scoring models for a specific use case, and returning one recommendation with a ready-to-run Ollama command.

agent-conversation.txttext

User: What's the best local coding model for my machine?

Agent calls: detect_hardware()
Returns: { gpu: "RTX 4070", vram: 12, ram: 32, backend: "CUDA", method: "nvidia-smi" }

Agent calls: recommend_model({ hw: ..., use_case: "Coding", priority: "balanced" })
Returns: {
  model: "qwen2.5-coder:14b",
  score: 87,
  quant: "Q4_K",
  tokPerSec: ~88,
  memFootprint: "9.1 GB",
  fit: "Perfect",
  explanation: "Qwen2.5 Coder 14B at Q4_K fits comfortably in your 12GB VRAM
    with 2.9GB headroom. Best coding benchmark result achievable on your GPU.
    Faster alternatives exist at 7B but score ~11 points lower on multi-file tasks.",
  pullCommand: "ollama pull qwen2.5-coder:14b",
  alternatives: ["codestral:7b", "deepseek-coder:6.7b", "qwen2.5-coder:7b"]
}

Agent reply: "Your RTX 4070 (12GB VRAM) is a strong coding GPU. I recommend
Qwen2.5 Coder 14B — run 'ollama pull qwen2.5-coder:14b' to get started."

Circuit board close-up showing the hardware layer that Runyard's MCP server auto-detects — Runyard reads your GPU specs directly from system commands — no manual config required.

Supported Backends and Expected Performance

The scoring engine sets baseline tok/s constants per backend. These scale down proportionally for larger models and lower-end hardware. Use them as relative benchmarks rather than guarantees — actual performance depends on context length, CPU speed, and thermal throttling.

▸CUDA (Nvidia) — 220 tok/s baseline. Best-supported backend. All quantizations stable.
▸ROCm (AMD) — 180 tok/s baseline. Support improving rapidly; some models need ROCm-specific builds.
▸Metal (Apple Silicon) — 160 tok/s baseline. Unified memory is a major advantage for larger models.
▸SYCL (Intel Arc / oneAPI) — 100 tok/s baseline. Still maturing; Q4 and Q8 most stable.
▸CPU ARM (Apple M-series, Raspberry Pi) — 90 tok/s baseline. Slow but usable at 7B Q4.
▸CPU x86 (AMD/Intel without discrete GPU) — 70 tok/s baseline. Practical for small models only.

Tips for Getting the Most Out of the Server

Call detect_hardware once and cache the result. HWSpecs does not change between prompts, so you do not need to re-detect on every tool call. Pass the cached object directly to score_models or recommend_model to reduce latency.

Use parse_hardware_text if your agent already has terminal output from a previous shell command. This avoids duplicate system commands and works well in sandboxed environments where the MCP server process has limited OS access.

The fit_filter parameter on score_models accepts "Perfect", "Good", "Marginal", or "TooTight". Filter to Perfect and Good for conservative recommendations. Include Marginal only when the user explicitly wants the largest possible model, even if it leaves little VRAM headroom.

On multi-GPU systems, nvidia-smi reports each GPU separately. The server sums VRAM across all detected GPUs on the same host. If your inference runtime uses tensor parallelism across all GPUs, this is correct. If each GPU runs independently, set the hw object manually with the VRAM of the single GPU your runtime uses.

What's Next for @runyarddev/hw-mcp?

Version 1.0.0 covers the core hardware detection and model scoring pipeline. Planned additions include: querying running Ollama instances to check which models are already downloaded, direct integration with the Runyard catalog for real-time model updates, and a list_backends tool that surfaces all detected compute backends in a single call.

The local LLM ecosystem now tracks 695 models across 110+ GitHub repos (Vibe Data, November 2025). That breadth makes hardware-aware filtering more valuable over time. As new models ship, the Runyard catalog updates — and @runyarddev/hw-mcp picks them up automatically on the next detect cycle.

Start Using @runyarddev/hw-mcp Today

The package is free, MIT-licensed, and requires only Node 18 or higher. Add the config block to Claude Desktop or Cursor, and every future conversation about local models will be grounded in your actual hardware — no spreadsheets, no manual spec lookups, no wasted downloads.

Visit runyard.dev to explore the full model catalog, use the hardware-aware Model Radar, and see which models your GPU can run today. The same scoring engine that powers @runyarddev/hw-mcp powers every recommendation on the site.

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter