Liquid AI's LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices

Liquid AI launched LFM2-VL-3B, a 3B parameter imaginative and prescient language mannequin for picture textual content to textual content duties. It extends the LFM2-VL household past the 450M and 1.6B variants. The mannequin targets increased accuracy whereas preserving the velocity profile of the LFM2 structure. It’s out there on LEAP and Hugging Face below the LFM Open License v1.0.

Mannequin overview and interface

LFM2-VL-3B accepts interleaved picture and textual content inputs and produces textual content outputs. The mannequin exposes a ChatML like template. The processor inserts an sentinel that’s changed with encoded picture tokens at run time. The default textual content context size is 32,768 tokens. These particulars assist devs reproduce evaluations and combine the mannequin with present multimodal pipelines.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Structure

The stack pairs a language tower with a form conscious imaginative and prescient tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus consideration spine. The imaginative and prescient tower is SigLIP2 NaFlex at 400M parameters, it preserves native side ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses picture tokens earlier than fusion with the language area. This design lets customers cap imaginative and prescient token budgets with out retraining the mannequin.

The encoder processes native resolutions as much as 512×512. Bigger inputs are cut up into non overlapping 512×512 patches. A thumbnail pathway offers world context throughout tiling. The environment friendly token mapping is documented with concrete examples, a 256×384 picture maps to 96 tokens, a 1000×3000 picture maps to 1,020 tokens. The mannequin card exposes consumer controls for minimal and most picture tokens and the tiling swap. These controls tune velocity and high quality at inference time.

Inference settings

The Hugging Face mannequin card offers really helpful parameters. Textual content technology makes use of temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Imaginative and prescient settings use min picture tokens 64, max picture tokens 256, and picture splitting enabled. The processor applies the chat template and the picture sentinel routinely. The instance makes use of AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

How is it educated?

Liquid AI describes a staged strategy. The workforce performs joint mid coaching that adjusts the textual content to picture ratio over time. The mannequin then undergoes supervised positive tuning targeted on picture understanding. The information sources are massive scale open datasets plus in home artificial imaginative and prescient knowledge for activity protection.

Benchmarks

The analysis workforce studies aggressive outcomes amongst light-weight open VLMs. On MM-IFEval the mannequin reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE rating is 89.01. The desk notes that scores for different methods have been computed with VLMEvalKit. The desk excludes Qwen3-VL-2B as a result of that system was launched in the future earlier.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

The language functionality stays near the LFM2-2.6B spine. The analysis workforce cites 30 % on GPQA and 63 % on MMLU. This issues when notion duties embody information queries. The workforce additionally states expanded multilingual visible understanding throughout English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese language, and Korean.

Why edge customers ought to care?

The structure retains compute and reminiscence inside small system budgets. Picture tokens are compressible and consumer constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves side ratios, which helps positive grained notion. The projector reduces tokens on the connector, which improves tokens per second. The analysis workforce additionally revealed a GGUF construct for on system runtimes. These properties are helpful for robotics, cellular, and industrial purchasers that want native processing and strict knowledge boundaries.

Key Takeaways

Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex imaginative and prescient encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native side ratios.

Decision dealing with and token budgets: Photos run natively as much as 512×512, bigger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for world context. Documented token mappings embody 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.

Inference interface: ChatML-like prompting with an sentinel, default textual content context 32,768 tokens, really helpful decoding settings, and processor-level controls for picture splitting allow reproducible analysis and simple integration in multimodal pipelines.

Measured efficiency: Reported outcomes embody MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only alerts from the spine are about 30% GPQA and 63% MMLU, helpful for blended notion plus information workloads.

LFM2-VL-3B is a sensible step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an environment friendly projector, which lowers picture token counts for predictable latency. Native decision processing with 512 by 512 tiling and token caps offers deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are aggressive for this measurement. Open weights, a GGUF construct, and LEAP entry cut back integration friction. Total, that is an edge prepared VLM launch with clear controls and clear benchmarks.

Try the Mannequin on HF and Technical particulars. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as properly.

Michal Sutter is an information science skilled with a Grasp of Science in Information Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking advanced datasets into actionable insights.

🙌 Observe MARKTECHPOST: Add us as a most well-liked supply on Google.

Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices

Comments

Leave a Reply Cancel reply