Skip to content

Speech Recognition

Speech recognition is an AI algorithm that converts voice data captured by a microphone into text. It is utilized across a wide range of fields including voice command operation, automatic meeting minute generation, multilingual translation pre-processing, and voice recording in nursing care settings.

Algorithm Overview

Combining audio signal processing with deep learning, speech is converted to text through the following flow:

  1. Audio Input: Acquisition of audio stream from microphone
  2. Preprocessing: Noise removal, Voice Activity Detection (VAD)
  3. Feature Extraction: Extraction of MFCC / filter bank features
  4. Acoustic Model: Phoneme-level recognition (CTC / Attention-based)
  5. Language Model: Context-aware text conversion

Supported Languages

Japanese, Chinese (Mandarin), and English support is planned.

Edge AI Board (RV1126B) Execution Efficiency

*Currently under performance evaluation. This page is preparatory-stage documentation.

Key Features

  • Edge processing: Local speech recognition without the cloud
  • Multilingual support: Japanese, Chinese, English
  • Low latency: Real-time processing via streaming recognition
  • Noise robustness: Stable recognition even under environmental noise

Use Cases

  • Hands-free device operation via voice commands
  • Voice input for nursing care records
  • Automatic generation of meeting and lecture minutes
  • Voice inspection records in factories
  • Pre-processing for multilingual speech translation
  • Call center voice-to-text conversion

Edge AI Board Implementation

Edge speech recognition leveraging the RV1126B NPU and DSP is under development. Local processing without network requirements achieves both privacy protection and low-latency response.