October 11, 2007

robots & data centers

two very important things:

1. Google is now providing academics with access to MapReduce. That includes access to a $30M research cluster.

2. DARPA urban challenge is on in less than a month - here are the latest updates & vids are here is a list of competitors with photos.

ap.

October 07, 2007

Memewars

Meme-trackers are curious beasts. Since I first discovered memeorandum.com a couple years back, I have been keenly studying them and their full-text clustering algorithms.

Yesterday’s News, Today.

The basic meme-tracker premise is simple – they scour the blogosphere & newsosphere for the latest action so you don’t have to.

While they are all reasonably successful at it, I don’t think they have quite nailed the vision just yet, in the same way that AltaVista & InfoSeek didn't quite solve the web information retrieval problem. I say that because I keep finding myself drifting back to sites like Slashdot & Gizmag because they know stuff that the meme-trackers will never tell me.

The root cause is that most meme-trackers utilize inbound link weight for their ranking system in some way or another. Whether it is based on naïve 1st order link count, the recursive PageRank random surfer model or extracting bipartite authority graphs, they all share the same problem - emerging topics are not heavily linked to begin with and not all interesting topics attract sufficient links over time.

In short, too much reliance on inbound link weight can result in a lot of missed information with the remainder being delivered quite slowly.

BuzzTracker sold for $5M

Recently, when Yahoo purchased buzztracker.com for $5M that placed a valuation on the meme-tracker landscape. Assuming that valuation is a function of eyeballs, then based on Quantcast’s data, Technorati could be worth $700M. Going by Alexa's data, Technorati could be worth as much as $1.3B.

So who is winning?

Here is a list of meme-trackers, along with their current Alexa Rank & Quantcast Reach data.

Of the top 3 (according to Alexa), Technorati & Feedster both started life as blog search engines and have only recently evolved into meme-trackers. Topix on the other hand will probably evolve itself off the list soon as it is looking more and more like a social media site.

Continue reading "Memewars" »

July 16, 2007

Sorting a data.frame in R

I frequently find myself having to re-order rows of a data.frame based on the levels of an ordered factor in R.

For example, I want to take this data.frame:

	  product store sales
	1       a    s1    12
	2       b    s1    24
	3       a    s2    32
	4       c    s2    12
	5       a    s3     9
	6       b    s3     2
	7       c    s3    29

And sort it so that the sales data from the stores with the most sales occur first:

	  product store sales
	3       a    s2    32
	4       c    s2    12
	5       a    s3     9
	6       b    s3     2
	7       c    s3    29
	1       a    s1    12
	2       b    s1    24

I keep forgetting the exact semantics of how its done and Google never offers any assistance on the topic, so here is a quick post to get it down once and for all, both for my own benefit and the greater good.

Continue reading "Sorting a data.frame in R" »

May 18, 2007

whats the most important web2 acquisition to date?

So 24/7 Meda was just bought by WPP and the world has barely noticed.

"24/7 Real Media (Nasdaq:TFSM) is being bought by WPP Group Plc. (NYSE:WPP), marketing services company by revenue, for $649 million to beef up its presence in the fastest growing segment of the advertising market. WPP said on Thursday its GroupM agency expects online advertising to exceed $33 billion this year, or more than 8 percent of global ad spending, and is seen growing strongly in the future. 24/7 Real Media’s shares gained 3.8%." - 123jump.com.

Why is this so important? After all 24/7 Media is the other other white meat of online ad networks after Google & DoubleClick, while WPP is the world's 2nd largest marketing conglomerate after Omnicom. WPP has its roots firmly planted in the glamorously traditional world of tv, radio & press which on the surface is a bit of an odd fit.

Well 24/7 do have some rather neat stuff in the engine room, technology that has evolved over one and a bit decades. Back in the last boom I had the fortune of working with some of the guys that built some of their ad serving software - people that later went on to build Google's AdWords platform.

WPP bought this technology so they could have a unified technology platform to manage data across all their clients, globally.

This deal really marks the point in time when the marketing world has finally climbed to the top of the hill and announced to the world that the times are changing.

Welcome to Marketing 2.0.

April 04, 2007

Stock or Not

A good friend of mine Josh Reich has built a simple but compelling game where you are presented with a chart constiting of data from a real financial market alongside a chart of some random data - you are charged with the task of spotting the fakes.

Turns out its harder than it seems, although the average is 50%. On the surface that kinda says to me technical analysis is bogus, but here is the thing... people are either horrendously bad or exceptionally good at it.

I have been hassling Josh to include a survey to determine if there is a correlation between adeptness at spotting fake stock charts and being a successful trader (or a wall street zip code).

Which leads me to the question..
Can u guess the difference between a real stockchart and a random pile of junk?

February 12, 2007

solving the uNPsolvable?

Some startup is launching a new chip tomorrow. It makes the huge claim of being able to solve an NP-complete problem. They chose a great venue - the Computer History Museum in Silicon Valley, just around the corner from the Googleplex.

Lots of computing problems are NP, generally most of the good ones. The 25 word summary is that NP problems can not currently be solved in a reasonable amount of time except when the problem is reduced to a trivial size. An example of an NP-complete problem is finding the optimal route a FedEx truck should take. Fortunately for most of these problems, fast estimations do exist.

Where NP-completeness gets interseting is that if you can solve any single NP-complete problem (such as the fedex problem), you can use that same solution to solve every other NP-complete problem - ie, solve one and you have solved them all.

If they are for real then then implications are huge, from breaking encryption to achieving artificial intelligence. I am optimistically sceptical.. lets see what happens.

ap.

February 08, 2007

The Dreaded Heisenbug

Yay I used a bayes decision tree to isolate a bug today in a fraction of the time it would have otherwise taken.

About 48 hours ago I started work on repairing a Heisenbug. For the less geeky of you, Heisenbugs are a rather nasty class of software fault that “disappears or alters its characteristics when it is researched”.

Most bugs are generally the result of only a single input (or knob or button or whatever) being set to a single value (or range or whatever). Software testers live by this assumption, and 99% of the time it is true, so true that we tend to forget (or is that ignore?) the 1% of bugs that can’t be explained so easily.

After tearing my hair out all yesterday and going to bed feeling somewhat defeated, this morning I woke with a fresh mind, a new day and a small suspicion that perhaps this bug fell into that 1%.

After poking and prodding at my adversary for most of the morning it was pretty clear that this was occurring probabilistically and that some pairs of input combinations made the bug occur more frequently.

Turns out the randomness was due to a threaded race condition and a combination of three inputs being in a certain range tended to make it occur more frequently – knowing those settings (which fell out of the decision tree) was thankfully enough to explain why the bug was occurring.