Multimodal Prototyping: Beyond the Screen (Voice, Vision, Gesture)

1) Context & Hook

The keyboard and mouse are no longer the default inputs for everything. We have Siri, Google Lens, Apple Vision Pro (Gaze/Pinch), and Humane pins. Designing for these requires “Multimodal” thinking: Input = Voice; Output = Audio + Visual. You cannot prototype this in a static Figma frame. You need tools that can “hear” and “speak.” AI powers these tools, allowing designers to simulate complex conversational flows without needing a backend engineer.

2) The Technology Through a Designer’s Lens

Multimodal AI combines:

ASR (Speech-to-Text): The computer hearing the user.
LLM (Reasoning): Understanding the intent.
TTS (Text-to-Speech): Talking back.
Computer Vision: Seeing the user’s hand or environment.

Representative Tools:

Voiceflow: The industry standard for conversational AI design. Drag-and-drop logic for chatbots and voice assistants.
Bezi / Spline: 3D web design tools involved in Spatial Computing capabilities.
Protopie: High-fidelity prototyping that supports voice input and camera sensors.
ElevenLabs: Generative voice API (for realistic prototype voices).

3) Core Design Workflows Transformed

A. Chatbot / Voice Assistant Design

Old Workflow: Excel spreadsheets with “If/Then” logic. Hard to visualize.
AI Workflow: Voiceflow. Build the flow chart. Connect an LLM to “generate” responses so you don’t have to hard-code every error path.
Impact: Testable prototype in hours.

B. Spatial UI (AR/VR)

Old Workflow: Unity/Unreal (Developer tools). Steep learning curve.
AI Workflow: Bezi/Spline. “Generate a 3D chair.” Place it in AR. Add interaction: “On Gaze, scale up.”
Impact: Designers can explore AR concepts without C# code.

C. “Wizard of Oz” Testing

Old Workflow: Human researcher sits behind a clearer and “pretends” to be the AI.
AI Workflow: Connect the prototype to a real LLM API. The user talks to the prototype, the prototype actually answers (using GPT-4).
Impact: Realistic user testing data.

4) Tool & Approach Comparison

Tool	Primary Use	Strengths	Limitations	Pricing	Best For
Voiceflow	Conversation Design	Visualizing complex logic; API integrations.	Primarily for linear/branching logic (though LLM features added).	Free/$$	Chatbot Designers
Protopie	Sensor High-Fi	Access to phone sensors (tilt, mic, camera).	Steep learning curve; logic can get messy.	$$	Mobile Interaction
Bezi	Spatial (AR)	Web-based AR prototyping; collaborative.	Still evolving feature set.	Free/$$	AR Designers
Figma	The “Hub”	Everyone has it.	Terrible for voice/logic; very “static.”	-	UI Visuals

5) Case Study: Automotive Voice Assistant

Context: A car manufacturer wanted to redesign their in-car voice assistant to be less rigid (“I didn’t understand that”). The AI Workflow:

Prototype: Built in Voiceflow. Connected to an LLM (Claude).
Context: Injected system prompts: “You are a helpful driving assistant. Keep answers short (under 5 seconds) because the user is driving.”
Testing: Drivers tested the prototype in a simulator.
Insight: Users interrupted the AI constantly.
Refinement: Enabled “Barge-in” (interruption) logic in the design.

Metrics:

Task Success: Improved from 60% to 92% due to LLM flexibility.

6) Implementation Guide for Design Teams

Phase	Duration	Focus	Key Activities
1	Weeks 1-2	Voice	Learn Voiceflow. It’s the Figma of Voice. Build a simple “Pizza Ordering” bot.
2	Month 1	3D	Learn Spline. You need to know how to move in Z-space.
3	Month 3	Connect	Try Protopie to combine visual UI + Voice Input.

7) Risks, Ethics & Quality Control

Privacy: “Always listening” prototypes in user testing require strict consent. Mitigation: Explicit “Start/Stop” listening buttons.
Accessibility: Voice-only is exclusive (Deaf users). Visual-only is exclusive (Blind users). Mitigation: Multimodal Redundancy. Always show what you say; always say what you show.
Safety: In Automotive/Industrial, distraction kills. Mitigation: Test “Cognitive Load.” (e.g. ISO 15007 standard).

8) Future Outlook (2026-2028)

Ambient Computing: The computer disappears. You just speak to the room.
Gaze-Prediction: The UI highlights what you represent about to look at (using eye-tracking AI).
Action Step: Stop designing rectangles. Design Flows.

References

[1] Voiceflow, “State of Conversational AI 2025.”
[2] Apple, “Designing for visionOS,” WWDC 2025.
[3] Nielsen Norman Group, “Multimodal Interface Guidelines,” 2026.