Highly Scalable Bespoke Website Scraping System

As apart of one of my contracts, I was tasked with developing a bespoke website scraping system allowing the internal team to extract reviews and attributes from websites. The system was designed to be highly scalable, allowing the team to scale up the selenium workers that were needed to scrape a given website. The system also included a dashboard allowing the team to view the status of each scrape, as well as monitor the project status.

During production, it was able to scrape over 100k comments in less than a few hours, and allowed the team to scale up as needed to scrape large websites quickly and efficiently. When prompted with Bot detection technologies, the system was able to overcome these protections from websites by using a combination of Selenium and PhantomJS in order to scrape the website, combined with IP switching.

The system was able to overcome the following website protections:

-CAPTCHA

-Captcha2K (XEvil)

-ReCaptcha v1 and v2

-Multilayer Captcha

A desktop application was also developed using Electron where websites that provided significant Bot protection, humans were able to extract the reviews from websites with the click of the mouse . The desktop application was able to work as an extension of the web browser as well (in the chrome browser). It allowed users to extract reviews, manage their projects and monitor the status of their project.

The team needed a way of managing the data once it was extracted, so I developed some middleware so that any data was passed into their internal pipeline.

The system was built using Laravel, VueJS, Python, Selenium and Electron. It was hosted on Kubernetes infrastructure, with Redis used as a cache layer.

Like what you see then, fancy a chat?

Email me Phone me

Testimonials

When I wanted to launch my digital newsletter I needed a new website to do it justice. I approached Dean because he has always looked after my other websites in an efficient and timely manner. After a detailed brief he put together a basic design, which completely captured the feel of the site, and it went from there. With further consultation the site grew and the vision became a reality and I am now proud to be able to launch my newsletter off the back of a stylish and well-designed website. Dean also offers ongoing support which is vital to a business such as mine.

Sarah-Jane Prew - Cabin Safety Update