Highly Scalable Bespoke Website Scraping System

As apart of one of my contracts, I was tasked with developing a bespoke website scraping system allowing the internal team to extract reviews and attributes from websites. The system was designed to be highly scalable, allowing the team to scale up the selenium workers that were needed to scrape a given website. The system also included a dashboard allowing the team to view the status of each scrape, as well as monitor the project status.

During production, it was able to scrape over 100k comments in less than a few hours, and allowed the team to scale up as needed to scrape large websites quickly and efficiently. When prompted with Bot detection technologies, the system was able to overcome these protections from websites by using a combination of Selenium and PhantomJS in order to scrape the website, combined with IP switching.

The system was able to overcome the following website protections:

-CAPTCHA

-Captcha2K (XEvil)

-ReCaptcha v1 and v2

-Multilayer Captcha

A desktop application was also developed using Electron where websites that provided significant Bot protection, humans were able to extract the reviews from websites with the click of the mouse . The desktop application was able to work as an extension of the web browser as well (in the chrome browser). It allowed users to extract reviews, manage their projects and monitor the status of their project.

The team needed a way of managing the data once it was extracted, so I developed some middleware so that any data was passed into their internal pipeline.

The system was built using Laravel, VueJS, Python, Selenium and Electron. It was hosted on Kubernetes infrastructure, with Redis used as a cache layer.

Like what you see then, fancy a chat?

Email me Phone me

Testimonials

As a design company, we have always had our own ideas of how something should look and the style we wanted to use, Dean was great at creating the websites and 3D modelling for our products.

Always readily available for updates and small changes that are constantly needed throughout the development of the site. The work from the start has always been top quality, but as projects have moved on quality has actually improved to levels we hadn’t thought available are now present on our sites.

With limited knowledge on the more technical side of website design such as SEO, Dean was always happy to take the lead and offer help with setting up the parts of the website we hadn’t originally thought of as directly important.

We have worked on many sites with Dean and continually try to keep the websites up to date on the both the Design and Technical sides of it and look forward to working together on future projects.

Jake Sacree - Gecko Head Gear