Files
ai.interactive.fiction/references/SPECIFICATION.md

598 lines
28 KiB
Markdown

# Code Guidlines
**1. Asynchronous Programming Principles:**
* **Primary Mechanism:** Use `async`/`await` and Promises for handling asynchronous operations.
* **Non-Blocking:** Ensure the main thread remains responsive. Long-running operations (like Kokoro loading) should be handled in a way that doesn't block UI updates or animations (e.g., using `requestIdleCallback` if appropriate, or careful yielding).
* **Event-Driven Communication:** Use a dedicated event system (like the `ModuleEvent` class created) for communication between the loader and modules (e.g., for progress updates, state changes, messages) instead of injecting callbacks directly from the loader into module methods.
**2. Module System Standards & Dependency Management:**
* **Native ES Modules:** Utilize the browser's native ES Module system (`import`/`export`, `<script type="module">`) without relying on build tools.
* **Lean Loader:** The `loader.js` file should be focused *only* on:
* Orchestrating the loading of module scripts.
* Monitoring module initialization progress and state via the event system.
* Displaying the loading status UI.
* Hiding the overlay and potentially starting the main application loop *after* all modules are finished.
* **Module Responsibility:** All module-specific logic, configuration, resource loading (like CSS, images, or specific libraries like Kokoro), and detailed progress reporting should reside *within* the respective module file, not in `loader.js`.
* **Dependency Declaration:** Modules must declare their dependencies (e.g., `ui-controller` depends on `tts` and `animation-queue`).
* **Loader Enforces Order:** The loader is responsible for ensuring that a module's `init` phase only begins *after* all its declared dependencies have reached the `FINISHED` state.
* **Rely on Dependency Management:** Modules should *assume* their dependencies will be loaded and ready before their `init` function is called by the loader. There should be **no** conditional checks within a module like `if (dependencyModule)` with fallbacks for when the dependency isn't ready.
**3. Module Interface & Code Sharing:**
* **Base Class:** Use a `BaseModule` class that all modules extend. This enforces a consistent interface (e.g., `initializeInterface`, `getState`) and provides shared functionality (e.g., `changeState`, `reportProgress`, event dispatching).
* **Module Registry:** Use a central `moduleRegistry` to register modules and facilitate dependency checking and management.
* **Preserve Functionality:** When adapting existing modules (like `ui-controller`) to the new `BaseModule` interface, all original functionality must be preserved and integrated correctly, not replaced with placeholders.
**4. State Management:**
* **Defined States:** Modules must adhere to the defined states: `PENDING`, `LOADING` (script loading), `WAITING` (waiting for dependencies), `INITIALIZING` (running `init` logic), `FINISHED`, `ERROR`.
* **Accurate Reporting:** Modules must accurately report their state transitions via the event system. A module (like `tts`) should not report `FINISHED` until all its critical internal operations (including background loading like Kokoro) are complete. The loader's UI must display these states correctly.
**5. Handling `setTimeout` and Fallbacks:**
* **`setTimeout` for Flow Control/Synchronization:** **Strictly prohibited.** Using `setTimeout` to wait for asynchronous operations to complete, fix timing issues, or manage dependencies is considered a hack and indicates a flaw in the asynchronous architecture. Proper use of `async`/`await`, Promises, and the loader's dependency management should make this unnecessary.
* **`setTimeout` for Delays:** Acceptable *only* within well-encapsulated components for specific, justifiable reasons (like debouncing, throttling, or potentially *very* short delays *if absolutely unavoidable* after direct DOM manipulation, though this should also be minimized). It must **not** be used to paper over asynchronous race conditions or timing problems. The `AnimationQueue` is an acceptable place for internal scheduling timeouts, but application code calling it should rely on its event-driven nature.
* **Fallbacks for Missing Dependencies:** **Strictly prohibited.** Code within a module should not check if a dependency exists and provide a fallback path. The module loader's responsibility is to guarantee dependencies are met before initializing the module. Errors should be handled for *actual* failures during initialization, not for unmet dependencies (which indicates a loader bug).
Adhering to these principles will lead to a cleaner, more robust, and maintainable asynchronous module loading system.
# Module Loader System Architecture
The module loader system is designed to manage the loading and initialization of modular components in a structured, dependency-aware manner with visual progress reporting.
## Overall Architecture
1. **Module Registry Pattern**: Uses a centralized registry to track and manage all modules and their states.
2. **Event-Driven Communication**: Modules communicate with the loader and each other through custom events.
3. **Progress Visualization**: Provides a visual loading overlay with per-module progress tracking.
4. **State Management**: Tracks each module through defined states (PENDING, LOADING, WAITING, INITIALIZING, FINISHED, ERROR).
5. **Dependency Resolution**: Handles module dependencies to ensure proper initialization order.
## Core Components
1. **ModuleRegistry**: Central repository for all modules
- Tracks registration and availability of modules
- Manages promises for module readiness
- Provides dependency resolution through `waitForModule` and `waitForModules`
2. **BaseModule**: Abstract base class that all modules extend
- Implements standard lifecycle methods
- Handles progress reporting and state changes
- Provides consistent interface for the loader
3. **ModuleLoader**: Main orchestrator of the loading process
- Dynamically loads module scripts
- Creates and manages the visual loading interface
- Initializes modules in the correct order
- Tracks and displays overall loading progress
4. **ModuleEvent**: Custom event system for inter-module communication
## Loading Sequence
1. HTML page loads and includes the loader script as a module
2. DOMContentLoaded triggers the loader initialization
3. Loader creates the loading UI and registers event listeners
4. Module scripts are loaded dynamically in parallel
5. Each module registers itself with the registry
6. Modules are initialized with dependency checking
7. Progress is reported and visualized throughout
8. When all modules reach FINISHED state, loading overlay is hidden
## Module Lifecycle
1. **PENDING**: Initial state before loading begins
2. **LOADING**: Module is loading dependencies
3. **WAITING**: Module is waiting for dependencies to be ready
4. **INITIALIZING**: Module's initialize() method is executing
5. **FINISHED**: Module is fully initialized and ready
6. **ERROR**: Module encountered an error during initialization
## Integration Pattern
Modules follow a consistent registration pattern:
```javascript
// Create the singleton instance
const ModuleName = new ModuleNameClass();
// Register with the module registry
moduleRegistry.register(ModuleName);
// Export the module
export { ModuleName };
// Keep a reference in window for loader system
window.ModuleName = ModuleName;
```
This design creates a flexible, maintainable system for loading complex applications with multiple interdependent components, prioritizing both user experience and performance.
# TTS System Structure & Kokoro Loading
After reviewing our chat history, here's a summary of the TTS system structure and how we decided to load the Kokoro TTS engine:
## Overall TTS System Architecture
1. **Modular Design**: The TTS system uses a modular architecture with multiple handler classes, each implementing a different TTS approach.
2. **Three TTS Providers**:
- `BrowserTTSHandler` - Uses the built-in Web Speech API
- `KokoroHandler` - Uses Kokoro.js neural TTS for high-quality voices
- `ApiTTSHandler` - Uses external TTS services like ElevenLabs
3. **Factory Pattern**: `TTSFactory` manages the handlers, provides a unified interface, and handles provider switching.
4. **Module System**: `TTSPlayer` module is registered with the `moduleRegistry` as part of the modular loading system.
## Loading Sequence
1. The module loader first loads `tts-player.js`, which in turn loads the `tts-factory.js`.
2. The factory initializes providers in order of preference:
- First loads the `BrowserTTSHandler` for immediate low-quality TTS
- Then loads the `ApiTTSHandler` if configured
- Finally attempts to load `KokoroHandler` in the background with low priority
3. The system uses the best available provider, with a preference for Kokoro when available.
## Kokoro TTS Loading Strategy
After consulting the documentation (https://www.npmjs.com/package/kokoro-js), we made these decisions:
1. **Low-Priority Loading**: Kokoro is loaded with `requestIdleCallback` to avoid impacting page performance.
2. **Kokoro npm package integration**: Load Kokoro directly from the local server:
'/js/kokoro-js.js' contains the minified complete code of the kokoro npm package copied from the node_modules folder to the public directory. Do not try to read or change it, it is too big!
3. **Pipeline Creation**: Per documentation, we use the pipeline pattern:
```javascript
this.kokoro = await window.kokoroTTS.pipeline('text-to-speech', {
quantized: true,
progress_callback: this.progressCallback
});
```
4. **Voice List**: We hardcoded the available voices rather than querying them dynamically.
5. **Audio Playback**: Synthesis returns an audio element which we play:
```javascript
const audio = await this.kokoro(processedText, {
voice: this.voiceOptions.voice,
speed: this.voiceOptions.speed
});
audio.play();
```
## User Experience Flow
1. User sees page immediately with browser TTS enabled (fast startup)
2. Kokoro loads in background without blocking the interface
3. Once Kokoro is ready, TTS switches to higher quality neural TTS
4. User can manually switch between providers via the UI if desired
This design prioritizes performance and user experience, making the TTS system both flexible and resource-efficient.
# Important practices
- Always import the following error, when debugging console output: onpage-dialog.preload.js:121 Uncaught ReferenceError: browser is not defined. This is producced by the installed adblocker and has nothing to do with our project.
# Text-to-Speech Synchronization Architecture
The TTS system needs to be synchronized with text animations to create a cohesive user experience. This section outlines the requirements and implementation approach.
## Transition to Game
The overlay fades away as the first scheduled animation.
- This fade animation is handled by the animation scheduler module
- Only after successful fade-out does the game loop start
- Socket connection is established and begins receiving text
## Text Buffering & Sentence Processing
1. **Text Buffer Collection**: Incoming text from sockets is collected in a buffer.
- System can receive fragments of any size (single letters, words, sentences, paragraphs)
- All text is accumulated in the buffer regardless of fragment size
- Buffer handles partial/incomplete text gracefully
2. **Sentence Detection**: The buffer identifies complete sentences.
- When full sentences are detected, they are extracted from the buffer
- If multiple sentences arrive simultaneously, they are split and processed individually
- Remaining partial sentences stay in the buffer until completion
## Synchronized Playback
1. **TTS Generation Queue**: Complete sentences enter the TTS generation queue.
- Generation begins immediately if no other sentence is being processed
- Results are cached for immediate playback when needed
2. **Animation Timing**: Animation speed is synchronized with audio duration.
- The system calculates animation duration to match TTS audio length exactly
- Both animation and audio start simultaneously
- Animation completes at the same time as audio playback
3. **Playback Pipeline**: Continuous processing of sentences.
- As soon as one sentence completes playback, the next begins
- Next sentence generation starts during current sentence playback
- This creates a seamless reading experience
## Fast-Forward & Control Flow
1. **Fast-Forward Behavior**: User can skip current sentence.
- Pressing the designated fast-forward key completes current animation immediately
- TTS audio is faded out and stopped
- System advances to next sentence
2. **Resource Management**: TTS generation is resource-conscious.
- Uses only CPU/GPU resources not needed for animation
- Generation process can be cancelled by fast-forward
- System prioritizes smooth animation over TTS preparation
3. **Loading States**: Animation waits for TTS when necessary.
- If next sentence TTS generation isn't ready when needed, animation pauses
- Fast-forward key can skip incomplete generation
- User is never blocked completely by TTS generation
## Persistent Configuration
1. **Options Storage**: The persistence-manager stores TTS settings.
- Speech on/off state is remembered
- Speed settings are preserved between sessions
- Voice preferences are stored
2. **Options UI**: Add an options button and modal dialog.
- Show additional options in a modal window
- Include volume sliders:
- Master volume control
- TTS volume control
- Music volume control
- Sound effects volume control
- Include manual TTS system selection
- All settings are persisted via the persistence-manager
This synchronized approach ensures that text animations and speech work together seamlessly, creating a more immersive storytelling experience while maintaining smooth performance.
# Text Output Pipeline Architecture
The text output pipeline manages the flow of text from server reception to visual display and audio playback, with a focus on performance and synchronization.
## Core Components
1. **Socket Client**: Receives raw text fragments from the server.
2. **TextBuffer**: Accumulates fragments and identifies complete sentences.
- Collects all incoming text regardless of fragment size
- Identifies and extracts complete sentences
- Maintains partial sentences until completion
3. **SentenceQueue**: Manages the preparation pipeline for sentences.
- Receives complete sentences from TextBuffer
- Orchestrates parallel processing of TTS generation and text layout
- Ensures sentences are fully prepared before playback
- Maintains a queue of sentences ready for playback
4. **TTS Generation System**: Prepares audio for sentences.
- Generates audio in the background without blocking UI
- Provides audio duration information for synchronization
- Can be cancelled for fast-forward operations
- Falls back to character count duration calculation when disabled
5. **Typography Processor**: Enhances text presentation quality.
- Applies smart typography (quotes, em-dashes, etc.)
- Handles hyphenation for line breaks
- Preserves special formatting
6. **ParagraphLayout**: Calculates optimal text presentation.
- Computes line breaks using Knuth-Plass algorithm
- Determines word positioning and timing
- Adjusts animation duration to match audio length
7. **AnimationPlayerQueue**: Manages the playback pipeline.
- Maintains a playlist of ready-to-play sentences
- Inserts DOM elements for prepared sentences
- Coordinates CSS-based animations
- Monitors animation completion
- Automatically advances to next sentence
## Process Flow
1. **Preparation Pipeline**:
- Socket client receives text and feeds it to TextBuffer
- TextBuffer identifies complete sentences
- SentenceQueue receives complete sentences
- TTS generation and layout processing happen in parallel
- When both TTS and layout are complete, sentence is marked "ready"
- Ready sentences are added to AnimationPlayerQueue
2. **Playback Pipeline**:
- AnimationPlayerQueue plays the first ready sentence
- DOM elements are inserted and CSS animations begin
- TTS audio plays simultaneously with animations
- AnimationPlayerQueue monitors complete animation duration
- When playback completes, the next ready sentence immediately begins
3. **Fast-Forward Handling**:
- Can interrupt at any stage of the pipeline
- Currently playing animations are immediately completed
- Currently playing audio is faded out and stopped
- Any in-progress sentence preparation is cancelled
- System advances to the next sentence in queue
## Speed Synchronization
1. **Audio-Driven Timing**: Animation speed is determined by audio duration
- TTS audio length dictates animation duration
- Without TTS, duration is calculated from character count and speed setting
2. **Seamless Transitions**: Next sentence begins immediately after current completes
- No gap between sentence playbacks
- Preparation happens during playback of previous sentence
3. **Feedback Loop**: Animation system provides timing data back to preparation pipeline
- Helps optimize future sentence preparation
- Allows runtime adjustment of timing parameters
This architecture separates preparation from playback, creating a buffer of ready content that enables smooth presentation while handling the computational overhead of text processing and TTS generation in the background.
# Text Processing & Layout Architecture
The text processing and layout system transforms raw text input into visually appealing, typographically correct, and elegantly animated content through several specialized components.
## Component Interactions
### Core Components
1. **text-processor.js**: Enhances typography and applies hyphenation
- Entry point for text processing pipeline
- Manages SmartyPants for typographic enhancements
- Controls Hyphenopoly for language-aware hyphenation
- Serves as the central coordinator for text transformation
2. **smartypants.js**: Provides typographic punctuation conversion
- Transforms straight quotes to curly quotes
- Converts hyphens to em-dashes and en-dashes
- Handles ellipses and other typographic niceties
- Operates as a pure function with no dependencies
3. **paragraph-layout.js**: Manages paragraph structure and word metrics
- Breaks text into words and calculates their dimensions
- Manages paragraph-level styling and layout properties
- Prepares text for the line-breaking algorithm
- Connects text-processor output to the layout engine
4. **knuth-plass.js**: Implementation of the optimal line-breaking algorithm
- Calculates aesthetically pleasing line breaks
- Minimizes "raggedness" across paragraph lines
- Implements the core Knuth-Plass algorithm
- Uses linked-list.js for internal data structures
5. **linked-list.js**: Provides data structures for the line-breaking algorithm
- Implements doubly-linked list for efficient node insertion/removal
- Supports the complex data relationships in the Knuth-Plass algorithm
- Pure utility with no direct interaction with other components
6. **hyphenopoly.module.js**: Performs language-aware hyphenation
- Contains language-specific hyphenation patterns
- Provides functions to insert soft hyphens at valid breaking points
- Loaded dynamically when needed by text-processor.js
7. **layout-renderer.js**: Translates calculated layout into DOM elements
- Takes the output from paragraph-layout.js
- Generates DOM structure for the text display
- Creates CSS classes and styles for animations
- Prepares text for display and animation
## Process Flow
1. **Text Input → Typography Enhancement**
```
Raw Text → text-processor.js → smartypants.js → Enhanced Text
```
- Raw text enters the text-processor
- SmartyPants functions transform quotation marks, dashes, etc.
- Typography-enhanced text is produced
2. **Typography-Enhanced Text → Hyphenation**
```
Enhanced Text → text-processor.js → hyphenopoly.module.js → Hyphenated Text
```
- Enhanced text is passed to the hyphenation system
- Language-specific rules determine valid hyphenation points
- Soft hyphens are inserted at appropriate positions
3. **Hyphenated Text → Layout Calculation**
```
Hyphenated Text → paragraph-layout.js → knuth-plass.js → Optimized Layout
```
- Paragraph layout breaks text into words and calculates metrics
- Knuth-Plass algorithm calculates optimal line breaks
- linked-list.js provides the data structures for this process
- An optimized layout structure is produced
4. **Layout → Rendering**
```
Optimized Layout → layout-renderer.js → DOM Elements
```
- Layout renderer converts the abstract layout to concrete DOM
- CSS classes and styles are applied for animation
- Words are positioned according to the calculated layout
5. **Rendering → Animation**
```
DOM Elements → AnimationQueue → Visual Display
```
- The rendered DOM elements are passed to the animation system
- Words are animated according to timing and styling parameters
- Visual presentation occurs synchronized with audio if applicable
## Implementation Dependencies
```
text-processor.js
├── smartypants.js
└── hyphenopoly.module.js
└── [language pattern files]
paragraph-layout.js
├── knuth-plass.js
│ └── linked-list.js
└── [font metrics]
layout-renderer.js
└── [CSS styling]
```
## Integration Points
1. **Text Buffer → Text Processor**
- Text buffer passes complete sentences to the processing pipeline
- Text processor enhances typography and applies hyphenation
2. **Text Processor → Paragraph Layout**
- Enhanced text flows to paragraph layout for structure analysis
- Word metrics and paragraph properties are calculated
3. **Paragraph Layout → Layout Renderer**
- Optimized layout information is passed to the renderer
- Renderer creates DOM elements with appropriate styling
4. **Layout Renderer → Animation Queue**
- Rendered elements are scheduled for animation
- Animation timing is synchronized with TTS if enabled
This architecture ensures typographically beautiful text with optimal line breaks, proper hyphenation, and smooth animation, creating a professional reading experience.
# TTS Integration with Localization
## Architecture Overview
The Text-to-Speech (TTS) system has been refactored to seamlessly integrate with the localization module, ensuring a cohesive user experience across different languages. This integration follows these key architectural principles:
1. **Base TTS Handler Pattern**: All TTS handlers extend a common `TTSHandler` class that inherits from `BaseModule`, ensuring consistent interface and behavior.
2. **Dependency Injection**: TTS handlers access the localization and persistence modules through the dependency system rather than direct global references.
3. **Locale-Aware Voice Selection**: TTS handlers automatically select appropriate voices based on the current locale.
4. **Preference Persistence**: User preferences for TTS settings are stored and retrieved through the persistence manager.
5. **Optional Functionality**: TTS is treated as an optional feature that can be unavailable without breaking the application.
## Core Components
1. **TTSFactory**: Central coordinator for TTS functionality
- Manages initialization of all TTS handlers
- Implements fallback mechanisms when preferred TTS systems are unavailable
- Provides access to the active TTS handler
- Integrates with localization module for language-aware voice selection
- Reports TTS availability to the UI
2. **TTSHandler**: Abstract base class for all TTS handlers
- Defines common interface methods (speak, stop, getVoices, etc.)
- Provides shared utility functions for voice selection and preference handling
- Extends BaseModule for dependency management and event handling
3. **TTS Handlers**: Concrete implementations for different TTS approaches
- **BrowserTTSHandler**: Uses the Web Speech API
- **ApiTTSHandler**: Communicates with a remote TTS API
- **KokoroHandler**: Provides neural TTS via Kokoro.js
4. **OptionsUI**: User interface for TTS configuration
- Allows selection of TTS system (Browser, API, Kokoro)
- Provides voice selection based on available voices for current locale
- Includes controls for volume, rate, and pitch
- Persists user preferences via PersistenceManager
## Localization Integration
1. **Locale-Based Voice Selection**:
- Each TTS handler implements `setupVoiceFromPreferences()` to select voices based on:
- User's explicitly saved voice preference
- Current locale from the localization module
- Fallback to language-matching voice if exact locale match not found
- Default voice (typically English) as final fallback
2. **Voice Filtering**:
- TTS handlers filter available voices to prioritize those matching the current locale
- Voice lists in the UI are sorted to show locale-matching voices first
3. **Preference Persistence**:
- TTS settings (system, voice, volume, rate) are saved per-user
- Settings are automatically applied when the application loads
- Changes in the localization settings trigger voice re-selection
## Initialization Flow
1. **TTSFactory Initialization**:
```
TTSFactory.initialize()
├── Loads user preferences via PersistenceManager
├── Initializes all available TTS handlers
│ ├── KokoroHandler.initialize()
│ └── BrowserTTSHandler.initialize()
├── Selects active TTS handler based on preferences and availability
├── Sets up event listeners for locale changes
└── Dispatches TTS availability event
```
2. **TTS Handler Initialization**:
```
TTSHandler.initialize()
├── Loads system-specific resources
├── Retrieves available voices
├── Gets dependencies (localization, persistenceManager)
└── Sets up voice based on preferences and locale
```
3. **Voice Setup Process**:
```
setupVoiceFromPreferences()
├── Gets user's preferred voice from persistenceManager
├── If preferred voice exists and is available:
│ └── Use preferred voice
├── Otherwise:
│ ├── Get current locale from localization module
│ ├── Find voice matching current locale
│ ├── If no match, find voice matching language part
│ └── If still no match, use default voice
└── Update preference with selected voice
```
## Event Handling
1. **Locale Change Events**:
- When user changes locale in the UI, the localization module emits a 'locale-changed' event
- TTSFactory listens for this event and triggers voice re-selection in the active TTS handler
2. **TTS Preference Events**:
- Changes to TTS settings in the options UI trigger preference updates
- These updates are persisted and immediately applied to the active TTS handler
3. **TTS Availability Events**:
- TTSFactory dispatches 'tts:availability' events to notify the UI about TTS availability
- UI Controller listens for these events and updates the speech toggle button accordingly
## Error Handling and Fallbacks
1. **TTS System Fallbacks**:
- If the preferred TTS system fails to initialize, TTSFactory falls back to the next available system
- Priority order: Kokoro > Browser > None (with None being acceptable)
- API TTS is not used as a fallback as it requires manual configuration
2. **Voice Selection Fallbacks**:
- If preferred voice is unavailable, fall back to locale-matching voice
- If no locale match, fall back to language match
- If no language match, fall back to default (typically English)
3. **TTS Unavailability Handling**:
- If no TTS handlers are available, the system continues to function without TTS
- The speech toggle button is disabled in the UI
- The application remains fully functional for text-only interaction
This architecture ensures that the TTS system seamlessly adapts to the user's language preferences while maintaining a consistent and intuitive user experience across different locales, even when TTS is unavailable.