Partition-aware MinHash LSH deduplication library for large-scale text data curation on Apache Spark.