The bean bags release the speech recognition model 2.0, supporting multi-modal visual recognition and recognition of 13 foreign languages.

05/12/2025

On December 5th, Volcano Engine officially released the DouBao voice recognition model 2.0, built on the Seed hybrid expert large language model architecture. According to the official introduction, the 2.0 version model's inference capability has been improved, allowing for precise recognition through deep understanding of context, with a 20% increase in overall keyword recall rate. It supports multi-modal visual recognition, enabling understanding through both listening and seeing, with improved text recognition accuracy through visual information inputs such as single or multiple images. It also supports recognition of 13 overseas languages including Japanese, Korean, German, and French. Additionally, it has been upgraded to focus on complex scenarios such as proper nouns, names, locations, brand names, and easily confused homophones.