
Multimodal AI
New Approach to Document Image Parsing: Efficient Recognition and Structuring with Multimodal Models | Open Source Daily No.760
Dolphin: Multimodal Document Image Parsing Repo: bytedance/Dolphin Stars: 6.4k License: MIT Dolphin is a multimodal model for document image parsing, using heterogeneous anchor prompts to enable an “analyze first, then parse” workflow. Key Features * Two-stage processing: * Layout Analysis: Page-level layout detection that produces an element sequence in natural