Search Engine Spam

This is most likely the most important sub-project within the project right now. It is consuming a lot of my time, which I don’t have! Anyway, I can’t argue and say; the engines will be bulletproof regarding search engine spam. Newer the less, it will be close thanks to semantics.

One thing is certain; spam will by no means stop. It will evolve and there will always be some smart people out there looking for new ways - that’s just an enticement to continue refining the algorithm’s dealing with this - I have offend ask my self this question; why do people spam? – I’ve come to this rather radical conclusion; more than 50% of people in the world must have a below-average IQs and if you can get $1 from only the dumbest 0.001% of the billion people on the Internet, well that's a good salary.

The spamdexing-algorithm(s) has been designed to combat a lot of the known de facto technique used by spammers, and also some new ones that is operating remarkably good. But as I mention earlier it seems like the spammers is always one step further. And they are good, extremely good. I have and was forced to analyze many gigabytes of data looking for patters etc. before I even wrote anything.

One of the main issues in the system has been to combat Keyword stuffing sins its disturber the semantic algorithm’s and produce a lot of noise and increase indexing time. (…)On the other hand, time used to process and hunt down spam is more or less equal if it just passed through into the cores. One of my colleagues from my day job ones said; unused CPU time is time you’ll never get back.

The Centiverse spamdexing-algorithm(s) will operate from different kinds of angels with different characteristics and roles to achieve the best result to disguises between spam and non spam documents and sentences.

How they work? – That’s a trade secret.

Further reading can be found on this paper if you should be interested.

December 21, 2008 01:41 by Claus
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

More power, more data

The new server room is now ready – it has been for quite some time now - and the Centiverse is working satisfactorily in the new environment. My only concern is the heavy load on the switches and routers. Therefore I’ am in the process of adding two new IP’s to the system which should rapidly speed things up on the collection interfaces (bots etc.)

Furthermore I've added 3 repository nodes which act like a NLB. Sins all data element know where it should persist it self this add-on really rocks. Until now I have only observed what it did to the crawlers - increased there efficiency by 60%. – I can only imagine what will happen when the system gets 3 IP’s… The crawling period is almost done and the indexers will soon take over the game. How the system will react regarding the tree repositories when indexing starts will soon be reveal.

[15-oct-2008]
What can I say, this is fast now. The indexing time has been reduced by 2/3. Yesterday, I could see the throughput had increased from around 300 to 2000-2700 request pr. sec on a single language node. I simply needed to see this, so I opened the door into server room and was expecting a lightshow with flashing routers, switches etc. There was no lightshows – well, maybe there was, but my eyes and brain could not register the gabs between the flashes – it was like everything just had the “on” button enabled - this just proves that my design works and it can scale when it’s considered necessary.

October 14, 2008 00:33 by Claus
E-mail | Permalink | Comments (2) | Comment RSSRSS comment feed

The heat is on.

It has been a very busy month for me personally. First I got my second child, a little baby boy - Two children are taking a lot of time, actually everything you got. - Anyway, after I landed safely and could focus on the engine project again.

I was running low on physical space so I had no choice but order building materials, designing and start building/expanding a new server room.

This is still in progress, actually I just installed and added electric stuff and got some help from a pro to do some parts. I am really looking forwarding when it’s done with new air-con and racks.

It’s quite funny for the reason that I’m not sure what’s the hardest part for me in a project like this; building up infrastructure (all included) or designing and building search software. Well, all right I’m not a construction worker but it’s still funny to do some times.

August 2, 2008 02:56 by Claus
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed