The new server room is now ready – it has been for quite some time now - and the Centiverse is working satisfactorily in the new environment. My only concern is the heavy load on the switches and routers. Therefore I’ am in the process of adding two new IP’s to the system which should rapidly speed things up on the collection interfaces (bots etc.)
Furthermore I've added 3 repository nodes which act like a NLB. Sins all data element know where it should persist it self this add-on really rocks. Until now I have only observed what it did to the crawlers - increased there efficiency by 60%. – I can only imagine what will happen when the system gets 3 IP’s…
The crawling period is almost done and the indexers will soon take over the game. How the system will react regarding the tree repositories when indexing starts will soon be reveal.
[15-oct-2008]
What can I say, this is fast now. The indexing time has been reduced by 2/3. Yesterday, I could see the throughput had increased from around 300 to 2000-2700 request pr. sec on a single language node. I simply needed to see this, so I opened the door into server room and was expecting a lightshow with flashing routers, switches etc. There was no lightshows – well, maybe there was, but my eyes and brain could not register the gabs between the flashes – it was like everything just had the “on” button enabled - this just proves that my design works and it can scale when it’s considered necessary.
It has been a very busy month for me personally. First I got my second child, a little baby boy - Two children are taking a lot of time, actually everything you got. - Anyway, after I landed safely and could focus on the engine project again.
I was running low on physical space so I had no choice but order building materials, designing and start building/expanding a new server room.
This is still in progress, actually I just installed and added electric stuff and got some help from a pro to do some parts. I am really looking forwarding when it’s done with new air-con and racks.
It’s quite funny for the reason that I’m not sure what’s the hardest part for me in a project like this; building up infrastructure (all included) or designing and building search software. Well, all right I’m not a construction worker but it’s still funny to do some times.
So what do you do in your spare time? Well, I’ am downloading the Internet - People always look suspicious when I say this (?!?) - and when you are downloading HTML files and you start to parse them this is where your first problems starts.
My first thought was to use Regular Expressions as the backbone in the parsing code and I used quite a while refining and testing them with a big dataset.
Everything behaved quite nicely even under stress, the code was clean and I was absolutely sure that I got a winner until nodes that was running indexing software suddenly stopped responding, used all availably memory and all CPU cores were at maximum never to return to normal usages!
The problem was escalating like a tsunami when nodes running GRID algorithms suddenly lost parts of the data or was going into strange deadlocks patterns.
WTF is going on in there? I know the cores can and will use a lot of everything – that’s how they are designed - but this is a completely insane behavior and it’s random just to make matters worse.
After some digging it was clear that root curse of the problem was in the html data downloaded. As I told in the beginning, this is where the first problem emerged. The problem is relative straightforward; you have no control of structure a given html page has, therefore you has to think of everything – which might be a night mare just to think of - or use a general patter which is where Regular Expressions comes to the rescues. This is what I first assumed anyway! But it turned out to be a different story.
A browser is forgiven piece of software which try to compute all tags and scripts etc it encounter. When a page is error phone it will after all show something to the end user. One of the things that brought my Regular Expressions to it knees, was the fact a given tag dosent have to be closed in order to work. I am not thinking about image tag etc. no; I am talking about a simple href tag without its closing tag – in my case it was about 1500 links on a page with no end.
The problem is hidden deep within the Regex engine and how it works (in general, some are better than others), think of my 1500 links again, and imagine what will happen when the backtracking goes nuts. Actually the behavior has a word; Catastrophic Backtracking and after this discovery, I had no choice to abandon general patterns and that was the end of my Regex adventure!
The solution has been to write my own parser which would mimic a simple forgiven browser. If I had just done this in the first place, it would have saved me a lot of time, money and frustration.