Using Coding Agents to Design: My Real-World Test of AI Models - Shreyas Rao

Hey there! So I’ve been experimenting with AI coding agents for design work lately, and I wanted to share what I actually found when I put different models to the test. No hype, just real results from a real experiment.

Part 1: OpenCode — The Terminal-Based Coding Agent

What Is OpenCode?

OpenCode is an open-source AI coding agent that lives in your terminal, IDE, or desktop. Think of it as an AI pair programmer that you can summon with a keyboard shortcut.

At its core, OpenCode is a Go-based application with a rich Terminal User Interface (TUI). It lets you interact with AI models to write code, debug issues, refactor, and even design UI components — all from the command line.

Why Terminal-Based Agents Matter

You might wonder: why use a terminal agent for design work? Here’s what I’ve found:

Privacy-first: OpenCode doesn’t store your code or context data on remote servers. Everything happens locally.
Speed: No GUI overhead means faster interactions. opencode run "Build a dashboard" is faster than opening an IDE.
Multi-session: Start multiple agents in parallel on the same project.
Shareable sessions: Create public links to any session for collaboration or debugging.
IDE agnostic: Works in VS Code, Cursor, terminal, and as a desktop app.

Key Features for Design Work

LSP enabled: Automatically loads the right language servers for intelligent code completion
Multi-session agents: Run multiple agents in parallel — one for design, one for backend
GitHub integration: Authenticate with GitHub to use Copilot tokens
Any model support: Connect Claude, GPT, Gemini, Qwen, Kimi, or any model from 75+ providers
MCP server support: Extend with custom tools via the Model Context Protocol
Agent skills: Create and manage custom agents with tailored system prompts

The OpenCode Ecosystem

OpenCode isn’t just a CLI tool. It’s an ecosystem:

TUI (Terminal UI): Full interactive terminal experience
Web IDE: Access from any browser
Desktop app: Beta available on macOS, Windows, and Linux
IDE extension: VS Code, Cursor, and any terminal-supporting editor
Agent framework: Create custom agents with /agent create

Part 2: Pencil — Design as Code with MCP

What Is Pencil?

Pencil is a design tool that flips the traditional design workflow on its head. Instead of designing in a separate tool and then handoff-ing to developers, Pencil lets you design inside your code editor and land directly in code.

The core philosophy: Design on canvas. Land in code.

The .pen Format and Design-as-Code

Pencil uses a .pen file format — a code-native representation of design layouts. This means:

Design files live in your project repository
Version control your designs alongside your code
AI can read, modify, and generate designs programmatically
No more “where’s the latest Figma?” confusion

MCP Integration — The AI Connection

Pencil’s killer feature is its deep Model Context Protocol (MCP) integration. The Pencil MCP server runs locally and gives AI assistants the ability to:

Read your .pen files
Modify existing designs
Generate new design components from natural language
Apply design tokens and style guides

Supported AI Assistants:

Claude Code (CLI and IDE)
Cursor IDE
Windsurf IDE (Codeium)
Codex CLI (OpenAI)
Antigravity IDE
OpenCode CLI

Example Workflow

# 1. Start Pencil and open a .pen file
pencil design.pen

# 2. In Claude Code, ask for design changes
claude
# "Create a login form with email and password"
# "Add a navigation bar to this page"
# "Design a card component for my design system"

# 3. AI uses MCP tools to modify the .pen file
# 4. Changes appear on the canvas immediately

Pencil vs Traditional Design Tools

Feature	Pencil	Figma/Sketch	Traditional Handoff
Live in IDE	Yes	No	N/A
AI-modifiable	Yes (via MCP)	Limited	No
Version controlled	Yes (git)	Yes (cloud)	No
Design → Code	Direct	Plugin needed	Manual
Privacy	Local-only	Cloud-hosted	Local

It is free to use for now so I just went ahead with it.

Part 3: The Experiment — Testing 6 Models on the Same Design Task

Alright, here’s where it gets interesting. I wanted to see how different AI models actually perform when given the exact same design task. No cherry-picking, no special prompting — just raw capability comparison.

The Setup

I ran all models through OpenCode with the Pencil MCP and gave them this prompt:

Use pencil MCP, in the "./designs/chatbot-[model].pen" file, add the following features:
- Create a header section with company name "Fin Doc Analyzer" on the left and login/logout buttons on the right
- Create a chat history section on the left which shows past chats
- Create a document viewer section in the center which shows the uploaded financial pdf document
- Create a chat bot section in the right which allows the user to ask questions about the uploaded document.

The Models Tested

Local Models (Running on RTX 3090 via llama.cpp)

Model	Parameters
Qwen 3.5	27B (Dense)
Qwen 3.6	35B (MoE)
Gemma 4	26B (MoE)
Gemma 4	31B (Dense)

Cloud Models

Model	Provider
Kimi K2.5	Moonshot AI
Claude Sonnet 4.6	Anthropic
Kimi K2.6	Moonshot AI

Part 4: The Results — And Honestly? The Local Models Struggled

Let me show you what each model actually produced, because the difference between cloud and local was pretty stark.

Cloud Models

Claude Sonnet 4.6 — The Clear Winner

What I got:

Professional dark navy header with a branded icon (not just text)
Detailed chat history sidebar with 5 different chat items, timestamps, and message counts
Document viewer with an actual PDF preview showing financial content
Chat interface with proper message bubbles, user avatar, and input field
Consistent spacing, shadows, and visual hierarchy throughout

This felt production-ready. The attention to detail was impressive — things like the “+ New” button in the sidebar, the user avatar with initials, the subtle shadows on the PDF container. It just looked polished.

Kimi K2.5 — Solid and Professional

What I got:

Clean white header with a document icon next to the company name
Chat history with icons for each item and a “+ New Chat” button
Document viewer with upload button and a detailed PDF preview
Chat section with message bubbles

Honestly, this was nearly as good as Claude. The design was clean, modern, and functional. The use of icons in the chat history was a nice touch that Claude didn’t even have.

Kimi K2.6 — Surprisingly Disappointing

Okay, I expected K2.6 to be better than K2.5, but… it wasn’t. At all.

What I got:

Dark blue header (decent)
Basic chat history with minimal styling
Document viewer that was just a placeholder with an emoji — no actual PDF content
Chat section that looked okay but nothing special

The whole thing felt like a step backward. Less detail, less polish, less everything. The document viewer was especially weak — just a gray box with an emoji instead of actual content. I was genuinely surprised given how good K2.5 was.

Local Models

Now here’s where things got rough. I wanted local models to work well — who doesn’t want private, free AI design? — but the results were pretty disappointing across the board.

Qwen 3.5 (27B) — The Best of a Bad Bunch

Surprisingly, the dense Qwen model actually performed the best among local models.

What I got:

Blue header with the company name
Chat history sidebar with actual items and timestamps
Document viewer with an upload button (with icon!) and a proper placeholder
Chat section with welcome message, user message, and bot response
Even had things like the send button with an icon

Don’t get me wrong — this wasn’t as polished as the cloud models. The spacing was off, the colors were basic, and it lacked the refined details. But structurally? It was complete. All four sections were there and functional.

Qwen 3.6 (35B MoE)

This was confusing. MoE model, worse output.

What I got:

White header with basic login/logout buttons
Chat history with a “+ New Chat” button (nice touch)
Document viewer with actual PDF content (impressive!)
Chat section that was… incomplete

The main elements are there but the spacing was way off.

Gemma 4 (31B) — Slightly Better, Still Weak

The bigger Gemma was better than the 26B MoE version, but still not great.

What I got:

Header with actual button frames for login/logout
Chat history in a card-style container with items
Document viewer that was just a labeled rectangle — still no real content
Chatbot section with messages and an input field

It had less detail, fewer features, and just felt less complete. The document viewer was especially disappointing — just an empty box labeled “Document Viewer.”

The local models were not that great as the cloud models. I suspect part of the reason might be that the verification process (via pencil screenshot) did not work that well. I will need to dig into the session transcripts to verify this.

Part 5: Design Evaluation — Let’s Get Real About the Scores

I needed a way to objectively score these, so I created an evaluation framework based on what actually matters for a UI design:

Evaluation Framework

Criteria	Weight	What I Looked For
Layout Structure	25%	Are all 4 required sections present and correctly arranged?
Visual Hierarchy	20%	Is it clear what’s important? Good information architecture?
Color & Typography	20%	Consistent colors? Readable text? Professional look?
Component Detail	20%	Are buttons actual buttons? Icons present? Real content?
Spacing & Alignment	15%	Proper padding, margins, things lining up correctly

The Scores (And My Honest Assessment)

Model	Layout (25)	Visual (20)	Color/Typo (20)	Detail (20)	Spacing (15)	Total
Claude Sonnet 4.6	25	19	19	19	14	96/100
Kimi K2.5	25	18	18	18	13	92/100
Kimi K2.6	23	15	15	12	11	76/100
Qwen 3.5 (27B)	24	14	14	14	10	76/100
Qwen 3.6 (32B)	20*	15	15	15	11	76/100
Gemma 4 (31B)	22	13	13	11	9	68/100

What These Scores Mean

Claude Sonnet 4.6 is in a league of its own. It’s the only one that felt truly production-ready.

Kimi K2.5 is excellent value — nearly Claude-level quality for a fraction of the cost.

Kimi K2.6 was genuinely disappointing. I expected it to beat K2.5, but it was noticeably worse.

Qwen 3.5 is the local model winner, but that’s a low bar. It’s “okay” — not great.

Gemma 4 (both sizes) just wasn’t competitive.

Part 6: My Recommendations After All This Testing

For Professional Design Work

Use Claude Sonnet 4.6. Yes, it’s more expensive, but the quality difference is real. When you’re building something for production, the polish matters.

Kimi K2.5 as a budget alternative. If you need to cut costs, K2.5 gets you 90% of Claude’s quality at about 25% of the price. Just skip K2.6 — it’s worse for some reason.

For Privacy-Conscious Work

Qwen 3.5 (27B) is your best bet. It’s not amazing, but it’s the most complete of the local options. You get all four sections, functional components, and a design that won’t embarrass you.

Don’t bother with Gemma 4. Both sizes produced underwhelming results. The 26B was too minimal, and the 31B wasn’t much better.

The Hybrid Workflow I’m Actually Using

After all this testing, here’s what I’m doing in practice:

Avoid → Kimi K2.6, Gemma 4 (any size), Qwen 3.6 (incomplete output)
Exploration & Drafting → Kimi K2.5 (cheap, fast, good enough)
Production Polish → Claude Sonnet 4.6 (when it needs to look professional)
Privacy-Sensitive Projects → Qwen 3.5 local (accept the quality trade-off)

Part 7: The Bottom Line

So what did I learn from all this?

Cloud models are still significantly better for design work. The gap isn’t closing as fast as I’d hoped. Local models are usable for drafts and personal projects, but they’re not ready for professional design work yet.

The “good enough” threshold is real. For internal tools or quick prototypes, Qwen 3.5 local is genuinely fine. For customer-facing products, it’s worth paying for Claude or Kimi K2.5.

The tools are here, they’re usable, and they’re getting better. Part 2 is incoming

Tested in April 2026 on an RTX 3090 with llama.cpp for local models. All designs generated using the same prompt via OpenCode with Pencil MCP.

Design files, screenshots and session transcripts can be found here