PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Information Retrieval Python Packages

Python packages with the GitHub topic information-retrieval. Sorted by relevance, with stars and monthly downloads.
dorianbrown
rank-bm25

A Collection of BM25 Algorithms in Python

6.2M 1K 105
Unstructured-IO
unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.3M 15K 1K
RaRe-Technologies
gensim

Topic Modelling for Humans

5.1M 16K 4K
ashvardanian
simsimd

SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go 📐

4.2M 2K 116
ashvardanian
stringzilla

Up to 100x faster strings for C, C++, CUDA, Python, Rust, Swift, JS, & Go, leveraging NEON, AVX2, AVX-512, SVE, GPGPU, & SWAR to accelerate search, hashing, sorting, edit distances, sketches, and memory ops 🦖

3.1M 3K 124
jaidedai
easyocr

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

2.8M 29K 4K
embeddings-benchmark
mteb

MTEB: Massive Text Embedding Benchmark

2.7M 3K 608
xhluca
bm25s

Fast BM25 search in Python, powered by Numpy and Numba

1.4M 2K 99
deepset-ai
haystack-ai

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

776K 25K 3K
allenai
ir-datasets

Provides a common interface to many IR ranking datasets.

584K 389 52
FlagOpen
flagembedding

Retrieval and Retrieval-augmented LLMs

447K 12K 870
HKUNLP
instructorembedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

423K 2K 157
ashvardanian
numkong

SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go 📐

400K 2K 116
rapidsai
pylibraft-cu12

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

227K 1K 231
rapidsai
libraft-cu12

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

222K 1K 231
oeken
needle-python

Needle simplifies building RAG pipelines.

213K 45 2
illuin-tech
colpali-engine

The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

164K 3K 250
rapidsai
raft-dask-cu12

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

121K 1K 231
lightonai
fast-plaid

High-Performance Engine for Multi-Vector Search

112K 248 21
AmenRa
ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

90K 674 31
tensorflow
tensorflow-ranking

Learning to Rank in TensorFlow

72K 3K 478
lightonai
pylate

Late Interaction Models Training & Retrieval

72K 798 79
deepset-ai
farm-haystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

67K 25K 3K
castorini
pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

57K 2K 512
    • Data from PyPI, GitHub, ClickHouse, and BigQuery