Saturday, April 19, 2008

Social Net 2.0 is about the Edges not the Nodes

In a the social graph model, people are usually the nodes and their relationships to one another are the edges. Profile data become node properties. Relationship data (worked with, married to, used to date) are the edge properties. There are some twists on this, like making a company or school (say Stanford) into a node that people connect to, rather than an aspect of a relationship ("used to go to Stanford with").

Proposition 1

Today's graphs have lots of node data, but are very weak on edge data. It's the problem where

  1. If I come to your Facebook profile, it is hard or impossible to tell, just from your friends list, who you are actually friends with and why.
  2. You may reasonably hesitate to fess up to some legit social graph links because if this missing data. Like, say, an anti-copyright crusader who's really an old buddy from junior high, but whose link might jeopardize your Hollywood job chances if the relationship were misconstrued.

This binary friendship is the weakest form of social graph, especially as it gets syndicated via apps and APIs throughout the Internet. It's poor quality information.

Proposition 2

The next step, the of 1.1 of edge data, appears in schemes like FOAF, and apps like LinkedIn. LinkedIn at least asks how you know someone ... did you work with them? where? when? That information adds value to their network.

The uber-social-graph (the hypothetical union of graphs) that we imagine and talk about can only have real value when the edges have quality (how do you connect), quantity (how well), sign/directionality (let's face it, there some people I might like, know, or relate to more than they do me, even if we're generally on the same level). In fact, the edges are multidimensional ... multiple different linearly independent quality/quantity/sign groups might all apply over a single relationship. Like anytime you've spent time working professionally with a good friend. If it's someone you've been romantically involved with, toss another set of attributes on there as well.

Proposition 3

One way to get to 2.0 would be to ask people to create a profile for each relationship (friend) they add. Probably not likely to happen, because it's labor intensive in proportion to the value of the data: Saying I worked with someone is not as good as saying where it was and when, adding that we were pretty friendly and had beers a lot after work, adding that we were both really interested in, say, programming languages or biking, adding that I convinced him to volunteer with me on so-and-so's political campaign.

It's a lot of work.

Passive acquisition of the data would be easier and more accurate: my emails or IMs with someone would tell how often I talked to them, what about, in what context, etc. The existing email, IM, and soc net comms providers (e.g. facebook messaging) have most of this data already. Ideally they would get my approval and once-over before asserting any conclusions, since it's still early days as far as accuracy of automated semantic analysis.

Proposition 4

Even if passive acquisition and analysis were occurring and were accurate, the data could be quite wrong as long as my life spans multiple systems. E.g., gmail sees 50 emails with someone, all related to Java work. The Google graph system draws one conclusion. But if there are 2500 AIM messages somewhere else, about wacky topics and at different times of day or night to the same person, the picture of the relationship might look a lot different. So the data from many modalities of communication (chat, IM, email, TXT, phone calls) and many systems needs to be analyzed together.

Proposition 5

This will require an ingenious entity to manage the graph. The government's attempt to do more or less exactly what I'm talking about is likely to be kept far from us law-abiding citizens. (I consider it only a matter of time before the world's top cybercrime / warfare / terrorism groups actually compromise this database, but they're not about to give us an API to it either.)

It won't be practical to keep this data secret; users will need to understand that once they allow data to flow into the system, it will be syndicated and replicated forever; it can't be pulled back. This should not shock current users of social networks, who must assume that not only their friends list/network, but the history of its deltas over time, and any tracking-cookie-enabled assumptions about the people it, may already be in a Google cache somewhere or in someone's data scraping startup.

What is to be hoped is that there is some centralized mechanism for auditing, correcting, and marking things as questionable. I.e., some group of graph engines that have a higher degree of trust than generic web scraping. Incorrect data could be challenged by allowing the engines access to additional information.

Proposition 6

Automation in turn gives rise to graph spam and graph phishing: if a med vendor sends me 100 emails a day, does that imply something? If a foreign con-man tricks me into clicking his link or sending him an email, does that get him mileage in my social graph?

One way or another, this stuff is coming, so we may as well start figuring it out right now.

No comments: