Highly Scalable Bespoke Website Scraping System

As apart of one of my contracts, I was tasked with developing a bespoke website scraping system allowing the internal team to extract reviews and attributes from websites. The system was designed to be highly scalable, allowing the team to scale up the selenium workers that were needed to scrape a given website. The system also included a dashboard allowing the team to view the status of each scrape, as well as monitor the project status.

During production, it was able to scrape over 100k comments in less than a few hours, and allowed the team to scale up as needed to scrape large websites quickly and efficiently. When prompted with Bot detection technologies, the system was able to overcome these protections from websites by using a combination of Selenium and PhantomJS in order to scrape the website, combined with IP switching.

The system was able to overcome the following website protections:

-CAPTCHA

-Captcha2K (XEvil)

-ReCaptcha v1 and v2

-Multilayer Captcha

A desktop application was also developed using Electron where websites that provided significant Bot protection, humans were able to extract the reviews from websites with the click of the mouse . The desktop application was able to work as an extension of the web browser as well (in the chrome browser). It allowed users to extract reviews, manage their projects and monitor the status of their project.

The team needed a way of managing the data once it was extracted, so I developed some middleware so that any data was passed into their internal pipeline.

The system was built using Laravel, VueJS, Python, Selenium and Electron. It was hosted on Kubernetes infrastructure, with Redis used as a cache layer.

Like what you see then, fancy a chat?

Email me Phone me

Testimonials

I have had the pleasure of working with Dean for several years and have found him to be invaluable to our business. Available at all hours, swift, prompt with his actions and incredibly knowledgeable Dean makes my life so much simpler. He is brilliant at listening to some of my more fanciful ideas and gently leads me in the correct direction to ensure our website reaches its maximum potential. A genuine asset to our business and I look forward to working with him for many years to come.

Nick Compton - Elite West Holidays