Scalable Data Preprocessing Tool for Training Large Language Models
Scalable data pre processing and curation toolkit for LLMs