Highly Scalable Bespoke Website Scraping System
As apart of one of my contracts, I was tasked with developing a bespoke website scraping system allowing the internal team to extract reviews and attributes from websites. The system was designed to be highly scalable, allowing the team to scale up the selenium workers that were needed to scrape a given website. The system also included a dashboard allowing the team to view the status of each scrape, as well as monitor the project status.
During production, it was able to scrape over 100k comments in less than a few hours, and allowed the team to scale up as needed to scrape large websites quickly and efficiently. When prompted with Bot detection technologies, the system was able to overcome these protections from websites by using a combination of Selenium and PhantomJS in order to scrape the website, combined with IP switching.
The system was able to overcome the following website protections:
-CAPTCHA
-Captcha2K (XEvil)
-ReCaptcha v1 and v2
-Multilayer Captcha
A desktop application was also developed using Electron where websites that provided significant Bot protection, humans were able to extract the reviews from websites with the click of the mouse . The desktop application was able to work as an extension of the web browser as well (in the chrome browser). It allowed users to extract reviews, manage their projects and monitor the status of their project.
The team needed a way of managing the data once it was extracted, so I developed some middleware so that any data was passed into their internal pipeline.
The system was built using Laravel, VueJS, Python, Selenium and Electron. It was hosted on Kubernetes infrastructure, with Redis used as a cache layer.