The ghost in the machine

So what do you do in your spare time? Well, I’ am downloading the Internet - People always look suspicious when I say this (?!?) - and when you are downloading HTML files and you start to parse them this is where your first problems starts.

My first thought was to use Regular Expressions as the backbone in the parsing code and I used quite a while refining and testing them with a big dataset. Everything behaved quite nicely even under stress, the code was clean and I was absolutely sure that I got a winner until nodes that was running indexing software suddenly stopped responding, used all availably memory and all CPU cores were at maximum never to return to normal usages!

The problem was escalating like a tsunami when nodes running GRID algorithms suddenly lost parts of the data or was going into strange deadlocks patterns. WTF is going on in there? I know the cores can and will use a lot of everything – that’s how they are designed - but this is a completely insane behavior and it’s random just to make matters worse.

After some digging it was clear that root curse of the problem was in the html data downloaded. As I told in the beginning, this is where the first problem emerged. The problem is relative straightforward; you have no control of structure a given html page has, therefore you has to think of everything – which might be a night mare just to think of - or use a general patter which is where Regular Expressions comes to the rescues. This is what I first assumed anyway! But it turned out to be a different story.

A browser is forgiven piece of software which try to compute all tags and scripts etc it encounter. When a page is error phone it will after all show something to the end user. One of the things that brought my Regular Expressions to it knees, was the fact a given tag dosent have to be closed in order to work. I am not thinking about image tag etc. no; I am talking about a simple href tag without its closing tag – in my case it was about 1500 links on a page with no end.

The problem is hidden deep within the Regex engine and how it works (in general, some are better than others), think of my 1500 links again, and imagine what will happen when the backtracking goes nuts. Actually the behavior has a word; Catastrophic Backtracking and after this discovery, I had no choice to abandon general patterns and that was the end of my Regex adventure! The solution has been to write my own parser which would mimic a simple forgiven browser. If I had just done this in the first place, it would have saved me a lot of time, money and frustration.

June 5, 2008 00:39 by Claus
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

128 Bit numbers

One can dream of one Yottabyte of memory which is about 1,208,925,819,614,629,174,706,176 bytes or 2^80. I am not sure what the good guys at Mountain View are juggling with but it must be around 2 or 3 PB.

I once again found a strange name, I call them HyperLongs. They operate like a long (64 bit) but they can handle twice the amount of data – well, daaa! - At the moment I can do the four basic calculation operations like additions, subtractions, multiplications and divisions.

The reason for this move was relative simple; I considered it necessary to simply represent a given website/page and sometime even text phases as a “natural” number – 64bit was to short - and quickly determent various bits and pieces based around it. This works extremely well since the semantics statistics shows that almost 30% of the repository is near document duplicate or worse, exact duplicates.

The system can relative quickly run the algorithms over the entire document collection and locate what I consider as “waste of core processing time and storage capacity”. It’s a walk on a knife-edge, I know that, and the approach is (O(d log d)) time in a worst case scenario where all data would be equal. At processing time it’s O(d) time which is okay in the long run considering core processing time.

For more information about I-Match algorithm’s and idf ranges I can recommend books found here

May 8, 2008 01:10 by Claus
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed