Multimodal Prototyping: Beyond the Screen (Voice, Vision, Gesture)
We live in a post-mouse world. How to prototype Voice UIs using Voiceflow, Vision Pro apps using Spline, and multimodal AI interactions without writing code.

1) Context & Hook
The keyboard and mouse are no longer the default inputs for everything. We have Siri, Google Lens, Apple Vision Pro (Gaze/Pinch), and Humane pins. Designing for these requires “Multimodal” thinking: Input = Voice; Output = Audio + Visual. You cannot prototype this in a static Figma frame. You need tools that can “hear” and “speak.” AI powers these tools, allowing designers to simulate complex conversational flows without needing a backend engineer.
2) The Technology Through a Designer’s Lens
Multimodal AI combines:
- ASR (Speech-to-Text): The computer hearing the user.
- LLM (Reasoning): Understanding the intent.
- TTS (Text-to-Speech): Talking back.
- Computer Vision: Seeing the user’s hand or environment.
Representative Tools:
- Voiceflow: The industry standard for conversational AI design. Drag-and-drop logic for chatbots and voice assistants.
- Bezi / Spline: 3D web design tools involved in Spatial Computing capabilities.
- Protopie: High-fidelity prototyping that supports voice input and camera sensors.
- ElevenLabs: Generative voice API (for realistic prototype voices).
3) Core Design Workflows Transformed
A. Chatbot / Voice Assistant Design
- Old Workflow: Excel spreadsheets with “If/Then” logic. Hard to visualize.
- AI Workflow: Voiceflow. Build the flow chart. Connect an LLM to “generate” responses so you don’t have to hard-code every error path.
- Impact: Testable prototype in hours.
B. Spatial UI (AR/VR)
- Old Workflow: Unity/Unreal (Developer tools). Steep learning curve.
- AI Workflow: Bezi/Spline. “Generate a 3D chair.” Place it in AR. Add interaction: “On Gaze, scale up.”
- Impact: Designers can explore AR concepts without C# code.
C. “Wizard of Oz” Testing
- Old Workflow: Human researcher sits behind a clearer and “pretends” to be the AI.
- AI Workflow: Connect the prototype to a real LLM API. The user talks to the prototype, the prototype actually answers (using GPT-4).
- Impact: Realistic user testing data.
4) Tool & Approach Comparison
| Tool | Primary Use | Strengths | Limitations | Pricing | Best For |
|---|---|---|---|---|---|
| Voiceflow | Conversation Design | Visualizing complex logic; API integrations. | Primarily for linear/branching logic (though LLM features added). | Free/$$ | Chatbot Designers |
| Protopie | Sensor High-Fi | Access to phone sensors (tilt, mic, camera). | Steep learning curve; logic can get messy. | $$ | Mobile Interaction |
| Bezi | Spatial (AR) | Web-based AR prototyping; collaborative. | Still evolving feature set. | Free/$$ | AR Designers |
| Figma | The “Hub” | Everyone has it. | Terrible for voice/logic; very “static.” | - | UI Visuals |
5) Case Study: Automotive Voice Assistant
Context: A car manufacturer wanted to redesign their in-car voice assistant to be less rigid (“I didn’t understand that”). The AI Workflow:
- Prototype: Built in Voiceflow. Connected to an LLM (Claude).
- Context: Injected system prompts: “You are a helpful driving assistant. Keep answers short (under 5 seconds) because the user is driving.”
- Testing: Drivers tested the prototype in a simulator.
- Insight: Users interrupted the AI constantly.
- Refinement: Enabled “Barge-in” (interruption) logic in the design.
Metrics:
- Task Success: Improved from 60% to 92% due to LLM flexibility.
6) Implementation Guide for Design Teams
| Phase | Duration | Focus | Key Activities |
|---|---|---|---|
| 1 | Weeks 1-2 | Voice | Learn Voiceflow. It’s the Figma of Voice. Build a simple “Pizza Ordering” bot. |
| 2 | Month 1 | 3D | Learn Spline. You need to know how to move in Z-space. |
| 3 | Month 3 | Connect | Try Protopie to combine visual UI + Voice Input. |
7) Risks, Ethics & Quality Control
- Privacy: “Always listening” prototypes in user testing require strict consent. Mitigation: Explicit “Start/Stop” listening buttons.
- Accessibility: Voice-only is exclusive (Deaf users). Visual-only is exclusive (Blind users). Mitigation: Multimodal Redundancy. Always show what you say; always say what you show.
- Safety: In Automotive/Industrial, distraction kills. Mitigation: Test “Cognitive Load.” (e.g. ISO 15007 standard).
8) Future Outlook (2026-2028)
- Ambient Computing: The computer disappears. You just speak to the room.
- Gaze-Prediction: The UI highlights what you represent about to look at (using eye-tracking AI).
- Action Step: Stop designing rectangles. Design Flows.
References
[1] Voiceflow, “State of Conversational AI 2025.”
[2] Apple, “Designing for visionOS,” WWDC 2025.
[3] Nielsen Norman Group, “Multimodal Interface Guidelines,” 2026.
Related Articles

Design Ops & Workflow Automation: The "No-Boring-Work" Future
How to use AI to automate redlines, manage Jira tickets, and QA your designs. Tools like Zeplin, Linear, and custom Figma scripts.

AI-Driven Personalization & Adaptive Interfaces: Designing for "One User"
Standard "personas" are dead. Learn how adaptive UIs use real-time data to change layout, content, and flow for individual users. Tools like Dynamic Yield and personalized Layout LMs.

Accessibility & Inclusive Design with AI: Better Compliance, Less Friction
AI can auto-generate alt text, simulate color blindness, and audit contrast. But can it replace a human audit? We explore tools like Stark and axe DevTools.