Valley3 is an omni multimodal model aimed at e-commerce, with unified reasoning over text, images, video, and audio. Its notable twist is native multilingual audio support for short-video commerce workflows, which could matter for teams building multimodal product search and assistant experiences.
arXiv:2605.01278v1 Announce Type: new Abstract: In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve…