Vision Transformer Python Packages

mmdet

OpenMMLab Detection Toolbox and Benchmark

436K 33K 10K

mlx-vlm

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

348K 5K 524

mmcls

OpenMMLab Pre-training Toolbox and Benchmark

50K 4K 1K

mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

21K 4K 1K

pix2tex

pix2tex: Using a ViT to convert images of equations into LaTeX code.

11K 16K 1K

thepipe-api

Get clean data from tricky documents, powered by vision-language models ⚡

3K 2K 99

mambavision

[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone

2K 2K 139

towhee

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

2K 3K 261

fastervit

FasterViT: Fast Vision Transformers with Hierarchical Attention

2K 914 69

garmentiq

Free & Open Source. Precise and flexible garment measurements from images - no tape measures, no delays, just fashion - forward automation.

2K 20 4

pai-easycv

An all-in-one toolkit for computer vision

1K 2K 225

attention-and-transformers

Transformers goes brrr... Attention and Transformers from scratch in TensorFlow. Currently contains Vision transformers, MobileViT-v1, MobileViT-v2, MobileViT-v3

1K 14 2

clipq

A simple implementation of a CLIP that splits up an image into quandrants and then gets the embeddings for each quandrant

1K 7 1

towhee-models

Towhee is a framework that helps you encode your unstructured data into embeddings.

1K 3K 261

vision-transformers

Vision Transformers for image classification, image segmentation, and object detection.

1K 67 9

efficientvit-gml

open-set object detector

996 3K 240

conformal-clip

Few-shot CLIP classification with conformal prediction, probability calibration, and reliability metrics.

822 0 0

tfimm

TensorFlow port of PyTorch Image Models (timm) - image models with pretrained weights.

813 291 25

image-classification-jax

Image classification in JAX with ViT, resnet, cifar10, cifar100, imagenette, and imagenet

667 3 0

deepvision-toolkit

PyTorch and TensorFlow/Keras image models with automatic weight conversions and equal API/implementations - Vision Transformer (ViT), ResNetV2, EfficientNetV2, NeRF, SegFormer, MixTransformer, (planned...) DeepLabV3+, ConvNeXtV2, YOLO, etc.

604 42 7

mmdet-taeuk4958

OpenMMLab Detection Toolbox and Benchmark

578 33K 10K

seq2seqsharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, MacOS), multimodal model for text and images and so on.

480 211 43

clipcap

Using pretrained encoder and language models to generate captions from multimedia inputs.

479 100 14

vitaminp

VitaminP: a vision transformer-assisted multimodal integration network for pathology cell segmentation

473 8 1