I have never thought I should write something like this but you can construct your software so it runs to fast (?!?) The glitch revealed it’s self in strange pattern of wrecked communication towards sites. I sure had some troubles understanding this behavior, because the errors was spread total randomly across many millions requests - it was on my TODO list with a low priority – until I got a email from a Danish site owner, explaining that the Bot once again had done some trespassing and was not obeying the basic robots laws.
I did a runtime debugging towards his site and the Bot was obeying as intended. Now it was beginning to get strange. Could it be a change management problem regarding the new upgraded software towards the bots? - Which it was not - Then I looked for any case-sensitivity bugs – some times you just need to test for everything - No, free as charge. I looked at the neural network and the extended latent semantics scheme the bots where using, nothing to gain there ether. After some time I got it narrowed down into two plausible reasons; hardware and/or too much traffic.
You can always blame the hardware and often it carries some of the guilt. But in this case it was not a hardware related problem. The software that is running the Bots is built to adapt its launching pattern and request scheme based on response information from the target server to maximize the data flow without doing any DDoS.
It was time to do some sockets investigation. And there it was, the sockets were not behaving as planed. After I fixed this problem the error request rate was decreased by 90% - Damn, I should have done this investigation earlier – and now the bots where behaving much better under stress.
It has been an exciting weekend. The fist cycle ended Sunday afternoon, all my crawlers had preformed well with zero exceptions recorded. The master node launched the CentiverseBots again and instructed the organizer nodes to use the exact same crawling scheme as before.
The only different from the first cycle is the memory map structure a CentiverseBots is now trawling with when re-visiting a site. According to my calculation this memory map should reduce a crawl cycle by 40% or more thanks to the known structures and statistics gained from previous cycles.
In theory these maps will automatically reduce core pressure added through crawler nodes as they currently are able to filter in between noise and silence. So what are these memory maps? It’s actuarial a neural network decision pattern which can analyze the actual content of a given page at runtime. This will give the CentiverseBots the authority to determine when content has changed enough so a re-index is required.
The project now operates with two types of bots to download and verify content.
I’ve chosen to split them into two separate subsystems based on statistic details collected while running stress tests. When mining the crawl statistics different patterns quickly emerge. One of then told me that sites did come and go faster that I could ever had imagined.
I also realized some new and exciting possibilities when splitting the two systems.
The crawling system would be cleaner and it would operate way faster than before.
I had a total encapsulate verification system which could work completely independent of the real crawlers and only launched if considered necessary.
Controll
CentiverseBot