Vision Transformer Python Packages

mmdet

OpenMMLab Detection Toolbox and Benchmark

430K 33K 10K

mlx-vlm

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

349K 5K 506

mmcls

OpenMMLab Pre-training Toolbox and Benchmark

54K 4K 1K

mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

22K 4K 1K

pix2tex

pix2tex: Using a ViT to convert images of equations into LaTeX code.

11K 16K 1K

thepipe-api

Get clean data from tricky documents, powered by vision-language models ⚡

3K 2K 99

towhee

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

2K 3K 261

mambavision

[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

2K 2K 139

fastervit

FasterViT: Fast Vision Transformers with Hierarchical Attention

2K 914 69

garmentiq

Free & Open Source. Precise and flexible garment measurements from images - no tape measures, no delays, just fashion - forward automation.

1K 20 4

pai-easycv

An all-in-one toolkit for computer vision

1K 2K 225

attention-and-transformers

Transformers goes brrr... Attention and Transformers from scratch in TensorFlow. Currently contains Vision transformers, MobileViT-v1, MobileViT-v2, MobileViT-v3

1K 14 2

clipq

A simple implementation of a CLIP that splits up an image into quandrants and then gets the embeddings for each quandrant

1K 7 1

towhee-models

Towhee is a framework that helps you encode your unstructured data into embeddings.

975 3K 261

efficientvit-gml

open-set object detector

974 3K 240

vision-transformers

Vision Transformers for image classification, image segmentation, and object detection.

825 67 9

tfimm

TensorFlow port of PyTorch Image Models (timm) - image models with pretrained weights.

766 291 25

conformal-clip

Few-shot CLIP classification with conformal prediction, probability calibration, and reliability metrics.

736 0 0

haloblocks

Python library designed to make model experimentation seamless and fast. The goal was simple: treat every component (attention heads, MLPs, MoE layers) as a plug-and-play block so you can focus on the architecture, not the boilerplate.

572 5 0

image-classification-jax

Image classification in JAX with ViT, resnet, cifar10, cifar100, imagenette, and imagenet

555 3 0

deepvision-toolkit

PyTorch and TensorFlow/Keras image models with automatic weight conversions and equal API/implementations - Vision Transformer (ViT), ResNetV2, EfficientNetV2, NeRF, SegFormer, MixTransformer, (planned...) DeepLabV3+, ConvNeXtV2, YOLO, etc.

523 42 7

seq2seqsharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, MacOS), multimodal model for text and images and so on.

488 211 43

mmdet-taeuk4958

OpenMMLab Detection Toolbox and Benchmark

455 33K 10K

clipcap

Using pretrained encoder and language models to generate captions from multimedia inputs.

447 100 14

Search Packages