Speech Recognition
Speech recognition is an AI algorithm that converts voice data captured by a microphone into text. It is utilized across a wide range of fields including voice command operation, automatic meeting minute generation, multilingual translation pre-processing, and voice recording in nursing care settings.
Algorithm Overview
Combining audio signal processing with deep learning, speech is converted to text through the following flow:
- Audio Input: Acquisition of audio stream from microphone
- Preprocessing: Noise removal, Voice Activity Detection (VAD)
- Feature Extraction: Extraction of MFCC / filter bank features
- Acoustic Model: Phoneme-level recognition (CTC / Attention-based)
- Language Model: Context-aware text conversion
Supported Languages
Japanese, Chinese (Mandarin), and English support is planned.
Edge AI Board (RV1126B) Execution Efficiency
*Currently under performance evaluation. This page is preparatory-stage documentation.
Key Features
- Edge processing: Local speech recognition without the cloud
- Multilingual support: Japanese, Chinese, English
- Low latency: Real-time processing via streaming recognition
- Noise robustness: Stable recognition even under environmental noise
Use Cases
- Hands-free device operation via voice commands
- Voice input for nursing care records
- Automatic generation of meeting and lecture minutes
- Voice inspection records in factories
- Pre-processing for multilingual speech translation
- Call center voice-to-text conversion
Edge AI Board Implementation
Edge speech recognition leveraging the RV1126B NPU and DSP is under development. Local processing without network requirements achieves both privacy protection and low-latency response.