Disambiguation

May 30th, 2010, By talk

Which is Which?

As you probably know by now, Headup specializes in understanding the meaning of text in web pages, extracting the important objects, and bringing complementary content about them from around the web.  One of the most important requirements in carrying out this process is correctly identifying the meaning of words on the page, especially those that have more than one meaning.  If you fail to do this, you can fetch a lot of high-quality, real-time and personalized information – about the wrong topic…

In technical jargon, identifying the correct meaning of words is called “Word Sense Disambiguation”.  Or in the words of Led Zeppelin (from “Stairway to Heaven”):

There’s a sign on the wall

But she wants to be sure

‘Cause you know sometimes words

Have two meanings

Some words have different meanings depending on their context.  For example, the word “Apple” can mean the fruit, the technology company, the record label or the band.  “John Mack” can refer to the Chairman of Morgan Stanley, the musician, or the psychiatrist who specialized in alien abduction experiences. The term “Enterprise” can mean a   company, a city, a ship, a starship, a space shuttle, and much more.

On the other hand, the same term can appear in the text in many different formats.  For example, if you write a blog post that mentions Barack Obama, you might refer to him as Barack Obama, President Obama, the President, the U.S. President, President of the U.S.A., Barack, Obama, Mr. Obama, etc.  All of these terms refer to the same person, so an automated system seeking to understand the text should resolve all of them to the same entity.

There are various approaches to word sense disambiguation.  Some rely on the statistics of surrounding words; others require a training stage utilizing large pieces of text in which the meaning of words has been marked manually.  Headup’s approach to disambiguation is based on its knowledge graph, the ever-expanding collection of topics, attributes, and semantic relationships between them.  Combining information derived from the knowledge graph with analysis of syntax (they way word are combined into sentences), enables Headup to reach a very high rate of precision (the percentage of terms that are correctly identified) in its disambiguation process.

The examples below show how Headup can correctly identify terms that appear in plain text, even when the term has more than one meaning, or appears only partially in the text.  All of the examples are based on actual posts in blogs that are using Headup.

First, let’s take a look at an example from the film blog “HeyUGuys”.  Here you can see how Headup correctly identifies the word “Abrams” as referring to the writer and producer J. J. Abrams.

Disambiguation Image

And here’s another example: It is typical in gossip media to refer to celebrities using their first name only, to induce a sense of familiarity.  In the post below, taken from the blog “HitDanBack”, singer Mariah Carey is referred to only as “Mariah”.  This doesn’t stop Headup from correctly identifying her, based on understanding the topic of the blog post and the context in which the name appears.

Disambiguation Image

And in the final example below, you can see how Headup interprets the term “European Championship” in an article from the blog Jewlicious.  This term can mean a championship in any sport, but based on the context of the article and related terms that are identified in the text, Headup correctly interprets “European Championship” as referring to the European Figure Skating Championship.

Disambiguation Image

If you want to see more examples of Headup in action, visit www.headup.com and explore the various blogs that are already using Headup.  You can also test drive the engine for yourself in our Entity Extraction Playground.  Enjoy!

Bookmark and Share

Smart Search Finding Things in Groups

May 16th, 2010, By talk

Searching for stuff is sometimes tough.  If you know what you’re looking for, and you phrased your search term just right, then you usually get good results.  But if not, you’re in big trouble, doomed to endless sifting through the results, page by page until you find the thing that you were really looking for.

Search engines are good at finding terms, expressions, and pieces of text.  But that’s where their world ends: They don’t understand the meaning of the text they are searching for, and they know nothing about objects, entities or relationships.  In addition, they are not designed to find stuff in groups, but search for a single object each time.

For example, let’s say you are interested in seeing video clips of songs from the Dire Straits album “Brothers in Arms”.  If you search for “Dire Straits Brothers in Arms Album” on YouTube, you will get many links to video clips of the song “Brothers in Arms”, and some links to other songs in that album (if the album name appears in the clip description).  If you are lucky, you’ll get a link to a playlist called “Dire Straits Brother in Arms Album” prepared by some user who manually searched for these tracks by name.

YouTube search results for "Dire Straits Brothers in Arms Album"

But now look what happens if you execute the same query through Headup: Headup automatically digs into its database to find the tracks in the album, and searches for specific video clips of these tracks.  Then, it returns a nice “video wall” where each thumbnail links to a different track in the “Brothers in Arms” album.  The key here is that Headup “knows” what an album is, associates it with its tracks, and is smart enough to understand that YouTube hosts mainly videos of tracks, not full albums.  This type of reasoning and “smart search” implementation is way beyond the power of other “topic search” engines that do nothing more than search forwarding.

Headup video results for "Dire Straits Brothers in Arms Album"

Let’s take another example.  What if you are searching for a certain type of product by a certain brand – such as Samsung LED-backlit LCD TVs, or Sony Flash-based HD camcorders.  If you try these search terms in a regular search engines, you will get scattered results of news announcements, product reviews, and maybe a link to a specific product page.  But you’ll never get a list of actual TVs or camcorders that match these criteria, since the search engines can only search for the text you supplied, but don’t understand it.

When such a search is conducted through Headup, it queries its knowledge graph for items that match the requested criteria.  Since in Headup objects have meaning, properties and relations to other objects, it is quite easy to go through all the “Products” by the “Company” Sony, find the “Camcorder” type products, and filter only those items that have “Memory Type” equal Flash, and “Resolution” equal “HD”.  So executing such a query through Headup may result, for example, in a neat list of links to specific product pages, which may include media reviews, user reviews and price comparison with purchasing links.

Note that even though Headup currently does not support direct search, the “smart search” method is already implemented in the current pop-up widget and topic pages.  When you look at images, news or videos of a certain object or topic, Headup’s “smart search” works behind the scenes to bring you the most relevant content for that object, by understanding and utilizing its relationship to other objects.

Bookmark and Share

Facebook Likes – How big a deal is this?

April 25th, 2010, By talk
Facebook TouchGraph

Facebook TouchGraph via: http://www.flickr.com/photos/jurvetson/3346659199

Mark Zuckerberg’s dramatic announcement at last week’s f8 signifies that Facebook has decided to take upon itself a responsibility for enabling and encouraging users to expand their dialog outside the scope of Facebook and on the web at large. It’s an ambitious project aiming to add a personal and social aspect to every website capable of adding Facebook’s buttons.

Facebook planned the launch carefully and at the outset of the feature its already available on many top online destinations, including: CNN, ESPN, IMDb, and others. Backed by the knowledge that Likes are supported by such a team of powerhouses Zuckerberg ventured a prediction that Facebook likes would cross the 1 Billion mark 24 hours from launch.

Its interesting to note that at the most basic level there’s little that is new about the new feature. After all people have been “Liking” each other’s Facebook posts, some of which include links to external websites, from the very start.  In fact there’s a whole industry of  services and plugins aimed at doing exactly what Facebook has now made generally accessible itself. One might well ask “What’s all the excitement about?”

The excitement is justified in this case for a number of reasons:

  1. The scope of the move – Facebook’s announcement affects 400 million users on Facebook alone, this is before we count the countless millions of users on the partner sites mentioned above who aren’t Facebook users…
  2. The scope of data Facebook will own – For me the most significant fact that the announcement highlights is that Facebook will greatly increase the already unrivaled data set it owns regarding each and every one of its users. Facebook will now know not only whatever these users shared about themselves explicitly and via their social data, but will also have the ability to couple this data to a users “likes”, or in other words couple the personal and social data to behavioral data reflecting a user  acknowledged preferred webpages.

Why is this a big deal?

Other than the fact that Facebook, a private company, will now know more about each and every one of it’s users than any company has ever known before (shudder), there are deeper implications for the web of this new functionality that we’ll probably begin to see faster than we can imagine. Facebook powered sites have, at least in theory, the ability to provide a tailored individual experience for every Facebook user visiting them. They might show the user which of is his/her friends reacted to the content, and how, or they might suggest “Smart sharing” – offering those friend most likely to be interested in this particular type of content. The full extent of the future functionality the move enables is difficult to predict but there can be little doubt that the web is about to undergo a pretty significant change.

Bookmark and Share

How to Fit the Whole Web in a Small Box

April 4th, 2010, By drorg

I remember that day clearly.  As I entered SemantiNet’s offices, Sagie (our director of R&D) approached me holding a thin black box in his hand.

“We did it!” he said.

“Did what?” I asked, looking at the black box which resembled a hard disk drive on a diet.

“We managed to fit all of our Knowledge Graph on a 20GB Solid State Drive.  This small box holds all of the information in Wikipedia and dozens of other web sources”.

“Wow”, I said, “Unbelievable!  They say the world is getting smaller, but I never imagined that the web is getting smaller, too…”

To understand the significance of this achievement, you need to realize that Headup stores a lot of information.  And I mean a lot.  Headup knows about more than 100 million topics, spanning diverse fields from movies to microchips, and religion to rollerblades.  Each topic has several attributes, and the topics are connected to each other through semantic relationships.  For example, a band is connected to its albums, each album is connected to its tracks, a company is connected to its products, etc.

To get a better grasp on this, let’s look at a very small piece of the Headup knowledge graph described in the diagram below.  This piece of the graph is connected to the actress Angelina Jolie, one of the entities or objects that Headup knows about.  It includes the different pieces of information that Headup knows about her, gathered from various web sources.  General information about Angelina Jolie, such as her birth date, husband, and city of birth, is taken from Wikipedia.  The movies she appeared in, such as “Wanted”, “Bewulf” etc. are taken from IMDB.  Ratings for those movies are taken from RottenTomatoes,   Information about people who like each of these movies is taken from personal preferences that are exposed on social networks such as Facebook and MySpace.


Now imagine that Headup needs to store such detailed information about each one of millions of topics that appear in Wikipedia and in numerous other data sources, and constantly manage and update all the different relationships between them.  Currently we have over 300 million nodes (objects and attributes) in our graph, with over 2 billion connections.

To store this data and enable scalable, reliable and efficient access to it, we needed a highly-optimized data store, with a small footprint and super-fast performance.  Traditional off-the-shelf databases such as MySQL and Oracle are optimized for documents and transactions, but not for efficiently traversing relationships and properties of objects.   Recently, several dedicated Graph Databases (also called “Triplestores”) have become available, which are more optimized for semantic web applications.  However, as we kept adding more and more knowledge to Headup, we came to the conclusion that none of these off-the-shelf solutions were suitable for Headup.  So we had no choice but to develop our own data store, that is optimized for our needs.

It turned out that by creating a unique data store and optimizing it for the structure of our knowledge graph, we were able to achieve an order of magnitude boost in performance over existing solutions, both for building the graph and for accessing it.  Our data store can currently support up to 1 billion nodes (topics and attributes), and dozens of billions of edges (relationships between topics and attributes), so we still have plenty of room to grow.

Building such a huge graph in itself is a far from trivial task, since it requires processing amounts of data which cannot be stored in the computer’s main memory (RAM).  Hard drives are also not a good choice due to their limited access speeds and data transfer rates.  Therefore, we approached this challenge by utilizing the Hadoop software framework (inspired by Google’s MapReduce), which supports huge-scale, data-intensive distributed applications.  Using Hadoop enables us to build the graph in a completely distributed manner, so we can easily deal with this vast amount of data.

The raw data of our current graph spans about 500 Gigabytes.  Using numerous optimization techniques, we managed to compress this amount of data to just 15 GB, meaning that we can hold the whole graph in RAM. As we add more sources, the graph grows to a point where it is no longer cost effective to store it in RAM. For this reason, the graph can be easily stored on a commodity SSD, which costs around $100. Furthermore, the graph is designed for optimal utilization of the drive’s internal cache, block sizes and the fast random access.

The good news is that compressing the graph is done without compromising performance.  In fact, using a compressed graph actually increases its performance, since much less data has to be accessed and processed. Using our graph data store, we can find every piece of information in about 10 milliseconds using a hard disk drive, 0.1 milliseconds using a solid state drive, and just 0.1 microseconds when the information is  in RAM.

To demonstrate the performance of Headup’s graph database, let’s look at a typical Headup-powered web page which contains 50 terms and 700 candidates for disambiguating them.  Using our unique data sore, such a page can be processed in only 2 seconds using an HDD, and about 200 milliseconds using an SSD.  With such performance, we can easily support sites with millions of unique page views without overloading its computing resources.

“Can I have the graph for one night?” I asked Sagie.

“What do you need it for?  Do you have anything to add to this immense knowledge repository?” Sagie was totally surprised.

“Not really”, I said.  “I want to put it under my pillow when I sleep, and hopefully all the Angelina Jolie stuff you showed me will inspire my dreams…”

Bookmark and Share
Older Posts »