Skip to main content
Design

Multimodal Prototyping: Beyond the Screen (Voice, Vision, Gesture)

We live in a post-mouse world. How to prototype Voice UIs using Voiceflow, Vision Pro apps using Spline, and multimodal AI interactions without writing code.

4 min read
Multimodal Prototyping: Beyond the Screen (Voice, Vision, Gesture)

1) Context & Hook

The keyboard and mouse are no longer the default inputs for everything. We have Siri, Google Lens, Apple Vision Pro (Gaze/Pinch), and Humane pins. Designing for these requires “Multimodal” thinking: Input = Voice; Output = Audio + Visual. You cannot prototype this in a static Figma frame. You need tools that can “hear” and “speak.” AI powers these tools, allowing designers to simulate complex conversational flows without needing a backend engineer.

2) The Technology Through a Designer’s Lens

Multimodal AI combines:

  1. ASR (Speech-to-Text): The computer hearing the user.
  2. LLM (Reasoning): Understanding the intent.
  3. TTS (Text-to-Speech): Talking back.
  4. Computer Vision: Seeing the user’s hand or environment.

Representative Tools:

  • Voiceflow: The industry standard for conversational AI design. Drag-and-drop logic for chatbots and voice assistants.
  • Bezi / Spline: 3D web design tools involved in Spatial Computing capabilities.
  • Protopie: High-fidelity prototyping that supports voice input and camera sensors.
  • ElevenLabs: Generative voice API (for realistic prototype voices).

3) Core Design Workflows Transformed

A. Chatbot / Voice Assistant Design

  • Old Workflow: Excel spreadsheets with “If/Then” logic. Hard to visualize.
  • AI Workflow: Voiceflow. Build the flow chart. Connect an LLM to “generate” responses so you don’t have to hard-code every error path.
  • Impact: Testable prototype in hours.

B. Spatial UI (AR/VR)

  • Old Workflow: Unity/Unreal (Developer tools). Steep learning curve.
  • AI Workflow: Bezi/Spline. “Generate a 3D chair.” Place it in AR. Add interaction: “On Gaze, scale up.”
  • Impact: Designers can explore AR concepts without C# code.

C. “Wizard of Oz” Testing

  • Old Workflow: Human researcher sits behind a clearer and “pretends” to be the AI.
  • AI Workflow: Connect the prototype to a real LLM API. The user talks to the prototype, the prototype actually answers (using GPT-4).
  • Impact: Realistic user testing data.

4) Tool & Approach Comparison

Tool Primary Use Strengths Limitations Pricing Best For
Voiceflow Conversation Design Visualizing complex logic; API integrations. Primarily for linear/branching logic (though LLM features added). Free/$$ Chatbot Designers
Protopie Sensor High-Fi Access to phone sensors (tilt, mic, camera). Steep learning curve; logic can get messy. $$ Mobile Interaction
Bezi Spatial (AR) Web-based AR prototyping; collaborative. Still evolving feature set. Free/$$ AR Designers
Figma The “Hub” Everyone has it. Terrible for voice/logic; very “static.” - UI Visuals

5) Case Study: Automotive Voice Assistant

Context: A car manufacturer wanted to redesign their in-car voice assistant to be less rigid (“I didn’t understand that”). The AI Workflow:

  1. Prototype: Built in Voiceflow. Connected to an LLM (Claude).
  2. Context: Injected system prompts: “You are a helpful driving assistant. Keep answers short (under 5 seconds) because the user is driving.”
  3. Testing: Drivers tested the prototype in a simulator.
  4. Insight: Users interrupted the AI constantly.
  5. Refinement: Enabled “Barge-in” (interruption) logic in the design.

Metrics:

  • Task Success: Improved from 60% to 92% due to LLM flexibility.

6) Implementation Guide for Design Teams

Phase Duration Focus Key Activities
1 Weeks 1-2 Voice Learn Voiceflow. It’s the Figma of Voice. Build a simple “Pizza Ordering” bot.
2 Month 1 3D Learn Spline. You need to know how to move in Z-space.
3 Month 3 Connect Try Protopie to combine visual UI + Voice Input.

7) Risks, Ethics & Quality Control

  1. Privacy: “Always listening” prototypes in user testing require strict consent. Mitigation: Explicit “Start/Stop” listening buttons.
  2. Accessibility: Voice-only is exclusive (Deaf users). Visual-only is exclusive (Blind users). Mitigation: Multimodal Redundancy. Always show what you say; always say what you show.
  3. Safety: In Automotive/Industrial, distraction kills. Mitigation: Test “Cognitive Load.” (e.g. ISO 15007 standard).

8) Future Outlook (2026-2028)

  • Ambient Computing: The computer disappears. You just speak to the room.
  • Gaze-Prediction: The UI highlights what you represent about to look at (using eye-tracking AI).
  • Action Step: Stop designing rectangles. Design Flows.

References

[1] Voiceflow, “State of Conversational AI 2025.”
[2] Apple, “Designing for visionOS,” WWDC 2025.
[3] Nielsen Norman Group, “Multimodal Interface Guidelines,” 2026.

Tags:multimodalvoice UIspatial designVoiceflowprototypingAR/VR
Share: