Web Crawler With Multi-Level Topic Clustering

Clustering And Crawling Of News Articles

Description

An open-source project for crawling news articles from different websites with the help of the CommonCrawl project. Also includes a multi-level clustering algorithm using K-Means and the Latent Dirichlet Allocation to sort ~286,000 articles into their relevant topics.

The project includes the code published on GitHub and two datasets (unclustered & clustered).