Speech Recognition

Speech recognition is an AI algorithm that converts voice data captured by a microphone into text. It is utilized across a wide range of fields including voice command operation, automatic meeting minute generation, multilingual translation pre-processing, and voice recording in nursing care settings.

Algorithm Overview

Combining audio signal processing with deep learning, speech is converted to text through the following flow:

Audio Input: Acquisition of audio stream from microphone
Preprocessing: Noise removal, Voice Activity Detection (VAD)
Feature Extraction: Extraction of MFCC / filter bank features
Acoustic Model: Phoneme-level recognition (CTC / Attention-based)
Language Model: Context-aware text conversion

Supported Languages

Japanese, Chinese (Mandarin), and English support is planned.

Edge AI Board (RV1126B) Execution Efficiency

*Currently under performance evaluation. This page is preparatory-stage documentation.

Key Features

Edge processing: Local speech recognition without the cloud
Multilingual support: Japanese, Chinese, English
Low latency: Real-time processing via streaming recognition
Noise robustness: Stable recognition even under environmental noise

Use Cases

Hands-free device operation via voice commands
Voice input for nursing care records
Automatic generation of meeting and lecture minutes
Voice inspection records in factories
Pre-processing for multilingual speech translation
Call center voice-to-text conversion

Edge AI Board Implementation

Edge speech recognition leveraging the RV1126B NPU and DSP is under development. Local processing without network requirements achieves both privacy protection and low-latency response.