Web mining is the application of data mining techniques to discover patterns from the World Wide Web. It uses automated methods to extract both structured and unstructured data from web pages, server logs and link structures. Our AI based web mining technology is a technology that unifies web protocol analysis, data scraping, artificial intelligence, data processing etc. On top of this technology, we have built a distributed system to work with large volumes of information on the public web.
However, in today's world, as data volume on the interenet is explosively growing, the bar to collect large volumes of public data is also getting much higher than everbefore. Thanks to our AI based web mining technology, we are able to collect large volumes of public data without triggering any anti-bot detections.
For non-professionals, they can eaisly run into all kinds of roadblocks such as IP blockings, google captchas, hcaptchas, ...etc. Most of the time, it's quite difficult to deal with these roadblocks. So a common trap most non-professionals get into is that it appears easy to write some python scraping code, but later on they found that their code stops working very quickly or they found their python code doesn't work for some website at all. This is because the anti-bot mechnism can't be eaisly seen on the UI of the webpage. Mordern website can use tens of thousands ways to detect the bot and block your scraper. Also in many cases, those blocking mechnism is way more complicated than they appear on the front end UI. Just take a look at this cloudFlare 5 seconds checking example:
Cloudflare, Inc. is an American web infrastructure and website security company that provides content delivery network and DDoS mitigation services. Its services occur between a website's visitor and the Cloudflare customer's hosting provider, acting as a reverse proxy for websites. Its headquarters are in San Francisco.
The example below is just a reference for you to get some basic ideas on how cloudFlare works and what mechnism it is using to block robots. Once you get some basic understanding, you will realize how tough it is to deal with the roadblockers.
In this example we are going to use the cloudFlare 5 seconds check for the illustration. when you browse some site, you might have seen this:
If you think this is just a smiple browser check, you are probaly wrong. Let's take a deeper look at what is behind this 5 seconeds check, let's use chrome developer tools and intercept the traffic, you will see the websites sending some data to cloudFlare during the check. When it finisheds the 5 seconds check, in chrome dev tools, networking tab, you will see a token is returned by cloudFlare.
Lets' dig more with that token request. We need to pull all the java scripts from the sites and after doing a lot of debugging and searchng, we come to a piece of obfuscated js code that does the encryption to a bunch of variables, such as your browser's version, your mouse movement, your opertating system version, your browser's cookie parameters, your device's fingerprint etc...
A lot of mass during encryption process, see the screenshot below:
Also the js core algorightm is using highly obfuscated variable names,
If you dig into the js code, you will find a lot of intermediay variables with strange names, for exmaple,
Well, we are not going too far here since the purpose is not to show how exactly everything works, but just to illustrate that dealing with cloudFlare's bot detction isn't as easy as it appears from the UI. It could be a big black hole if you want to find out exactly it works.
So here it comes to the qustions, will a reguar bot be able to bypass this? The answer is unfortunately NO. Because a regular bot can be easily blocked by these roadblocks. Especially for those who are not professional bot engineers, the bot they wrote doesn't have any intelligence to bypass the roadblocks. When buliding our AI based web mining system, we have spent significants efforts tweking and testing the system so that it can bypass most of these kind of obstacles during the data gathering process.
Indestry's Lowest Priced Google SERP API Service, Scrape Google SERP Anonymously and Consistently
Web Scrape Google Flights Data to Get Real Time Airline TIcket Pricings and Flights Schedules
Web Cralwer to Extract Product and Category Data from Top Fashion Website Nordstrom.com
Web cralwers to harvest food delivery data from Ubereats, doordsash, grubhub ...
Web Crawlwers to scrape homedepot.com for product listings and product details data
Web Crawlers to scrape Facebook data such as Facebook Events Data etc.
Scrape realtime hotels data from Cosmopolitan Las Vegas hotels
Web crawlers to scrape China hotels data from top hotel websites such as holidayInn, Ctrip etc.
Grab Holdings Inc., commonly known as Grab, is a Southeast Asian technology company headquartered in Singapore and Indonesia. In addition to transportation, the company offers food delivery and digital payments services via a mobile app. Grab currently operates in Singapore, Malaysia, Cambodia, Indo
Collect millions of realestate data from Thailand major realEstate website ddproperty.com
Web crawlers to scrape lazada for product listings data and category data
Web Scraping product and category data from Fashionphile.com
Web Crawlers to Scrape Millions of products from Lowes.com
Web Crawlers to Scrape Global Interste Rate, Mortgage Rate, Deposit Rate
Web Crawlers to Scrape Millions of US Housing Properties from Zillow.com
Web Crawlers to Extract Millions of Product Data from Ecommerce Giant Walmart.com
One of the industry's best Web Crawlers(Service) for China Major Ecommerce Websites such as Tmall, JD, Kaola, PinDuoDuo etc.