Deboiler is an open-source package to clean HTML pages across an entire domain
Heuristic text extraction from news articles