Scenario

Voice and speech-driven input in contenteditable lacks a stable editing contract

OS voice typing, Web Speech API, and assistive voice tools insert or mutate contenteditable without the same event sequence as keyboard or IME. Caret position, selection, and composition events are inconsistent across platforms, which breaks editors that assume beforeinput/input/composition alignment.

accessibility

Scenario ID

scenario-voice-input-contenteditable

Edit on GitHub

Details

Voice-driven editing (browser Web Speech API, OS-level dictation, Dragon-style tools) interacts with contenteditable through paths that differ from physical keyboard and IME. Applications often see bulk insertText-style updates, missing composition* events, or asynchronous inserts that race with focus and selection changes. Unlike input elements, contenteditable has no dedicated “voice mode” flag in web APIs, so frameworks cannot reliably branch on the input source.

Problem Overview

Editor code frequently assumes:

beforeinput → DOM change → input with a coherent getTargetRanges() / selection snapshot.
IME paths emit compositionstart / compositionupdate / compositionend.

Voice input regularly violates (1) and (2): transcripts may arrive in one chunk or in word splits, selection may be stale if the user moved focus while recognition was in flight, and some platforms never emit composition events for dictation. That creates duplicate handling, wrong insertion offsets, and divergent undo stacks—similar failure modes to mobile dictation but on desktop and in custom Speech API integrations.

Observed Behavior

Web Speech API: Final results are delivered asynchronously; if the developer inserts text in a recognition.onresult handler, the active range may no longer be inside the intended contenteditable (user clicked elsewhere).
OS dictation (desktop): May insert directly into the focused editable; event patterns differ from the same engine’s behavior on mobile WebKit.
No standard detection: Web pages cannot read native “isDictation” state; heuristics (missing composition, large data in one beforeinput) are fragile and collide with some IMEs.

editable.addEventListener('beforeinput', (e) => {
  // Voice / Speech paths may still use insertText or insertCompositionText
  // but composition events may be absent (platform-dependent).
  console.log(e.inputType, e.data, e.getTargetRanges?.()?.length);
});

Impact

State sync: React/Vue/Svelte models can double-apply or miss chunks when duplicate or batched events occur.
Caret and selection: Inserts land in the wrong paragraph or outside the intended wrapper.
Accessibility: Users who rely on voice control need predictable focus; blur races break that predictability.
Undo/redo: Browser and custom history disagree when insertion is split or replayed.

Browser Comparison

iOS Safari: Dictation quirks are covered under scenario-ios-dictation-duplicate-events; composition often absent on iOS.
Chrome (desktop): Web Speech API is available; integration quality depends entirely on app code and timing.
Firefox: Speech recognition support and OS hooks differ; test before shipping voice features.
Safari (macOS): OS dictation may still emit composition events, unlike many iOS paths.

Solutions

Serialize inserts: Queue transcript segments and apply them inside a single requestAnimationFrame or microtask batch while verifying document.activeElement and a saved Range still belong to your editor root.
Snapshot range on mic start: When starting recognition, clone getSelection().getRangeAt(0) if it is inside the editor; on result, re-validate against editor.contains(range.startContainer) before mutating.
Idempotent reconciliation: After voice-driven updates, diff DOM vs model instead of trusting event counts.
Do not depend on composition for voice: Treat “large synchronous insert without composition” as a distinct path only if you accept false positives with some IMEs.

Best Practices

Treat voice and dictation as high-latency, selection-sensitive inputs; never assume the selection at recognition start equals the selection at result delivery.
Prefer one editorial root and explicit data-editor-root checks before mutating.
Cross-test iOS dictation, desktop OS dictation, and Web Speech API separately; they are not interchangeable.

ce-0585-chrome-web-speech-api-contenteditable-insertion – Async Web Speech API result delivery vs selection focus (Chrome desktop, draft)

scenario-ios-dictation-duplicate-events – iOS-specific duplicate beforeinput / input after dictation

References

MDN: Web Speech API – Recognition API overview
W3C Speech API – Draft community specification
MDN: InputEvent – beforeinput / input semantics

Scenario flow

Visual view of how this scenario connects to its concrete cases and environments. Nodes can be dragged and clicked.

Variants

Each row is a concrete case for this scenario, with a dedicated document and playground.

Case	OS	Device	Browser	Keyboard	Status
ce-0585-chrome-web-speech-api-contenteditable-insertion	Windows 10+	Desktop Any	Chrome 120+	US QWERTY (microphone for speech)	draft

Cases

Open a case to see the detailed description and its dedicated playground.

Web Speech API final results can insert at the wrong range if selection moves during recognition

OS: Windows 10+ · Device: Desktop Any · Browser: Chrome 120+ · Keyboard: US QWERTY (microphone for speech)

Open case →

Related Scenarios

Other scenarios that share similar tags or category.

Tags: accessibility

Accessibility Foundations: Screen Readers, ARIA, and the AX-Tree

Ensuring contenteditable editors are navigable for assistive technology users through proper ARIA mapping and engine synchronization.

5 cases

Tags: accessibility

Keyboard navigation accessibility issues in contenteditable

Keyboard navigation in contenteditable elements must comply with WCAG 2.1.1 (Keyboard) and 2.1.2 (No Keyboard Trap) requirements. The Tab key typically moves focus out of contenteditable, while arrow keys move the caret. Custom keyboard handling must ensure all functionality is keyboard-operable and focus remains visible.

0 cases

Category: accessibility

role="textbox" and screen reader behavior with contenteditable

Assistive technologies map contenteditable to textbox-like semantics when authors use role=textbox or aria-multiline—conflicts with actual HTML semantics and browser AX trees can confuse announcements.

1 case

Category: accessibility

Browser extensions interfering with contenteditable (Grammarly and others)

Extensions inject overlays, mutate DOM, or listen to input for grammar and translation—breaking framework-controlled editors, selection restoration, and IME composition boundaries.

1 case

Category: accessibility

Native spellcheck interfering with editing

Browser spellcheck underlines and may mutate DOM or fire input-adjacent updates—conflicting with custom dictionaries, code blocks, and IME composition.

1 case

Comments & Discussion

Have questions, suggestions, or want to share your experience? Join the discussion below.