Description
An open-source project for crawling news articles from different websites with the help of the CommonCrawl project. Also includes a multi-level clustering algorithm using K-Means and the Latent Dirichlet Allocation to sort ~286,000 articles into their relevant topics.
The project includes the code published on GitHub and two datasets (unclustered & clustered).