The heat is on.

It has been a very busy month for me personally. First I got my second child, a little baby boy - Two children are taking a lot of time, actually everything you got. - Anyway, after I landed safely and could focus on the engine project again.

I was running low on physical space so I had no choice but order building materials, designing and start building/expanding a new server room.

This is still in progress, actually I just installed and added electric stuff and got some help from a pro to do some parts. I am really looking forwarding when it’s done with new air-con and racks.

It’s quite funny for the reason that I’m not sure what’s the hardest part for me in a project like this; building up infrastructure (all included) or designing and building search software. Well, all right I’m not a construction worker but it’s still funny to do some times.

August 2, 2008 00:56 by Claus
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

The ghost in the machine

So what do you do in your spare time? Well, I’ am downloading the Internet - People always look suspicious when I say this (?!?) - and when you are downloading HTML files and you start to parse them this is where your first problems starts.

My first thought was to use Regular Expressions as the backbone in the parsing code and I used quite a while refining and testing them with a big dataset. Everything behaved quite nicely even under stress, the code was clean and I was absolutely sure that I got a winner until nodes that was running indexing software suddenly stopped responding, used all availably memory and all CPU cores were at maximum never to return to normal usages!

The problem was escalating like a tsunami when nodes running GRID algorithms suddenly lost parts of the data or was going into strange deadlocks patterns. WTF is going on in there? I know the cores can and will use a lot of everything – that’s how they are designed - but this is a completely insane behavior and it’s random just to make matters worse.

After some digging it was clear that root curse of the problem was in the html data downloaded. As I told in the beginning, this is where the first problem emerged. The problem is relative straightforward; you have no control of structure a given html page has, therefore you has to think of everything – which might be a night mare just to think of - or use a general patter which is where Regular Expressions comes to the rescues. This is what I first assumed anyway! But it turned out to be a different story.

A browser is forgiven piece of software which try to compute all tags and scripts etc it encounter. When a page is error phone it will after all show something to the end user. One of the things that brought my Regular Expressions to it knees, was the fact a given tag dosent have to be closed in order to work. I am not thinking about image tag etc. no; I am talking about a simple href tag without its closing tag – in my case it was about 1500 links on a page with no end.

The problem is hidden deep within the Regex engine and how it works (in general, some are better than others), think of my 1500 links again, and imagine what will happen when the backtracking goes nuts. Actually the behavior has a word; Catastrophic Backtracking and after this discovery, I had no choice to abandon general patterns and that was the end of my Regex adventure! The solution has been to write my own parser which would mimic a simple forgiven browser. If I had just done this in the first place, it would have saved me a lot of time, money and frustration.

June 4, 2008 22:39 by Claus
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed