Highly Scalable Bespoke Website Scraping System

As apart of one of my contracts, I was tasked with developing a bespoke website scraping system allowing the internal team to extract reviews and attributes from websites. The system was designed to be highly scalable, allowing the team to scale up the selenium workers that were needed to scrape a given website. The system also included a dashboard allowing the team to view the status of each scrape, as well as monitor the project status.

During production, it was able to scrape over 100k comments in less than a few hours, and allowed the team to scale up as needed to scrape large websites quickly and efficiently. When prompted with Bot detection technologies, the system was able to overcome these protections from websites by using a combination of Selenium and PhantomJS in order to scrape the website, combined with IP switching.

The system was able to overcome the following website protections:

-CAPTCHA

-Captcha2K (XEvil)

-ReCaptcha v1 and v2

-Multilayer Captcha

A desktop application was also developed using Electron where websites that provided significant Bot protection, humans were able to extract the reviews from websites with the click of the mouse . The desktop application was able to work as an extension of the web browser as well (in the chrome browser). It allowed users to extract reviews, manage their projects and monitor the status of their project.

The team needed a way of managing the data once it was extracted, so I developed some middleware so that any data was passed into their internal pipeline.

The system was built using Laravel, VueJS, Python, Selenium and Electron. It was hosted on Kubernetes infrastructure, with Redis used as a cache layer.

Like what you see then, fancy a chat?

Email me Phone me

Testimonials

Dean recently created an excellent animated logo for my website, Bude & Beyond (www.budeandbeyond.co.uk). What I never realised at the outset was how much time, effort, discussion, creativity and dedication goes into producing a logo. After an initial discussion, Dean created many versions, in different colour schemes and formats, which he also animated. He sought appropriate feedback and went the extra mile to ensure that the colours and design were absolutely spot on. I was very impressed at his determination to get it right and to re-create and adapt until both he and I were both very happy with the final result.

So, I’d say Dean’s major strength is communication. I’m not IT-focused, but I appreciate a good-looking website/logo, so it really helps that he is able to discuss ideas in a friendly and accessible way, without over-use of jargon. He explains things well, and will interpret ideas in a client-focused way. He is extremely responsive, so wait times are very short. He listens to his client and creates ideas based on what he hears/understands. He checks all along the way (wording, colours, images, positioning) and will ask: what if we try this? He is also very open to constructive comments. Therefore, he provides plenty of guided choice.

I’m rather proud to now have a Dean Wronowski animated logo already in use on my social media and ready for my website (currently undergoing a major revamp). Dean’s designs are his trademark. Fresh, colourful and lively, they always try to encompass the nature of the product he is working with.

He is a remarkable talent which Bude is fortunate to have

Dawn Robinson - Bude and Beyond