61 dependents
| Package | Description | Downloads/month |
|---|---|---|
| Transforms complex documents like PDFs and Office docs into LLM-ready markdown/J... | 282K | |
| Training library for Megatron-based models with bidirectional Hugging Face conve... | 29K | |
| One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio T... | 16K | |
| OpenCompass VLM Evaluation Kit for Eval-Scope | 15K | |
| Official implementation of HPSv3: Towards Wide-Spectrum Human Preference Score (... | 6K | |
| vLLM plugin for RBLN NPU | 4K | |
| Modular media quality metrics toolkit. | 3K | |
| AISAK, short for Artificially Intelligent Swiss Army Knife, is a general-purpose... | 3K | |
| Photonamer: Autonomous photo file renaming tool using local Visual-Language Mode... | 2K | |
| Cosmos-RL is a flexible and scalable Reinforcement Learning framework specialize... | 2K | |
| Roboreason package | 2K | |
| A PyTorch library for multi-modal image translation with diffusion bridges, GANs... | 2K | |
| Agent-as-Annotators: Structured Distillation of Web Agent Capabilities | 1K | |
| A wrapper for the Qwen2-VL model based for image-based inference to convert pdf ... | 1K | |
| INF Tech's open-source MLLMs for SOTA visual-language understanding and advanced... | 1K | |
| Local multimodal semantic search for large documents with complex diagrams (like... | 1K | |
| An ML package for GStreamer | 1K | |
| Cosmos-Predict2 is a collection of general-purpose world foundation models for P... | 1K | |
| Dora Node for VLM | 984 | |
| Computer Use OOTB | 853 | |
| [ICLR 2026] EditScore: Unlocking Online RL for Image Editing via High-Fidelity R... | 730 | |
| Evaluating Text-to-Visual Generation with Image-to-Text Generation. | 704 | |
| llama-index multi_modal_llms HuggingFace integration by [Cihan Yalçın](https://w... | 681 | |
| Dora Node for VLM | 600 | |
| Uses a VLM to caption images from a dataset. | 581 | |
| Hunyuan Video 1.5 | 531 | |
| A Python module for efficient multi-model AI inference with memory management | 470 | |
| AI agent swarm orchestrator for coding | 424 | |
| Geospatial Vision-Language Model analysis for street-level imagery. Download Map... | 406 | |
| Dora Node for RDT 1B | 397 | |
| An integrated fine-tuning platform for lightweight vlmOCR models | 382 | |
| Python runtime for Orign | 376 | |
| siiRL: Shanghai Innovation Institute RL Framework for Advanced LLMs and Multi-Ag... | 348 | |
| [ICML26 Spotlight] UniPercept: Towards Unified Perceptual-Level Image Understand... | 334 | |
| This repo is a fork of the original VLM2Vec repo, modified for easy Pyserini int... | 326 | |
| A Python package for OCR using Vision LLMs | 269 | |
| 260 | ||
| LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 20... | 208 | |
| Slimmed release mirror of UniTrust for AEN and TruthPrInt. | 192 | |
| Benchmark utilities and environments for evaluating multimodal LLMs' proactivene... | 190 | |
| Core package for OmAgent | 182 | |
| Use a local LLM to convert PDF to Markdown | 175 | |
| Vision-Language Model Interpretability Analysis - One Token at a Time | 168 | |
| NVIDIA Cosmos Reason VLM provider for Strands Agents - physical AI reasoning, vi... | 157 | |
| Add your description here | 146 | |
| We are building a python package for building computer use capability that can a... | 122 | |
| A simple no frills brute force unoptimized training package for VLMs | 121 | |
| Helper utilities and constants for GroundNext models - Computer Use Agents for g... | 118 | |
| A library for automating web tasks | 110 | |
| OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA mod... | 99 |