-
Recent Posts
Archives
- May 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- April 2011
- March 2011
- February 2011
- January 2011
- September 2010
- August 2010
- July 2010
- May 2010
- April 2010
- March 2010
- February 2010
- December 2009
- November 2009
- October 2009
- July 2009
- May 2009
Topsy deploys v2 platform to index 100 billion status updates
{topsy_retweet_button} We’ve had a fun week over here at Topsy. We finished the roll-out of version 2.0 of our platform, a major engineering and operational milestone for us.
Topsy is now the largest searchable index of content posted on Twitter – we recently indexed our 5 billionth tweet and 2.5 billionth link. Unlike most retrieval systems, Topsy organizes its search index in real-time, while still maintaining a long-term history. Our v2 architecture takes our search approach to a new level of scale – it is designed to index over 100 billion status updates and related objects, from any social network.
Under the hood
For search and operations geeks out there, here’s a peek into how Topsy’s v2 index architecture works. Our platform spans a cluster of over 500 servers and roughly a petabyte of storage. We receive tweets from Twitter through the Firehose (and other sources like search.twitter.com and Twitter’s REST API). Each tweet is written to our distributed queuing store (called the Swarm). Swarm provides a mechanism for hundreds of tasks to process a status update within milliseconds of arrival to turn it into indexable data. A typical pipeline to process a tweet looks like this:
- Metadata for the tweet, including user information (such as language, location, user-ID, profile photo) is extracted.
- All links present in the tweet are extracted, short URLs are expanded and URL redirects resolved, and the link is visited to fetch titles, descriptions, and to generate thumbnails for images.
- Relationships to other tweets or Twitter users are extracted and a link is made and verified. (For example, Topsy tracks replies to tweets, and finds and verifies original tweets for organic retweets by computing a similarity score for the original and retweeted text)
- A citation is created from the tweet, associating the user and the tweet text to links in the tweet (or to the original tweet if it’s a retweet).
- The text of the tweet is parsed and tokenized to make it easier to search regardless of the language of the text. We do linguistic analysis making it possible to search Topsy for Japanese and Simplified/Traditional Chinese text, among other languages.
- The tweet is loaded into the search index; Topsy has a unique citation index model, in which a text of a status update – treated as a citation – remains associated with its author and to links it contains. This allows the search engine to rank tweets, photos, and websites by relevance, social citation scores, and time.
- The tweet is loaded into Topsy’s in-memory Grapher, so that it shows up on Topsy Author pages and Topsy Trackback pages.
This pipeline runs tens of thousands of times a second and generates a 10x fanout (each tweet results in 10 or more pieces of data). We add close to 500 million pieces of data to our index every day.
Four Demons of Search
Traditional search architectures are designed to process large volumes data in batch processes. The large-batch approach doesn’t work for processing and indexing on real-time streams.
Search engines are constantly battling what we call the Four Demons of Search: query speed, relevance, frequency of index updates and recall (indexing historical data). Traditional web search engines compromise on update frequency; real-time search engines typically compromise on recall (historical data) and relevance. At Topsy, we’ve architected our search platform to scale up in all four dimensions using modern techniques for indexing and run-time query processing.
This is an exciting new direction in information retrieval, especially as large amounts of real-time data are being created and become available on the web. In a future post, we’ll cover the design choices that have resulted in our approach and architecture.
28 Responses to “Topsy deploys v2 platform to index 100 billion status updates”
Leave a Reply
You must be logged in to post a comment.
August 24th, 2010 at 5:43 pm
RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 24th, 2010 at 5:50 pm
Topsy under the hood !!!RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 24th, 2010 at 5:55 pm
Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 24th, 2010 at 5:57 pm
RT @r2g2: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi Me: Twitter search engine to keep an eye on.
August 24th, 2010 at 7:42 pm
Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 24th, 2010 at 7:42 pm
RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 24th, 2010 at 10:11 pm
v2 of @topsy platform indexing 100 billion tweets: http://bit.ly/akQRnY Good blog post on their underlying architecture.
August 24th, 2010 at 10:30 pm
RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 25th, 2010 at 5:19 am
Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/ckmCgO
August 25th, 2010 at 10:17 am
RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
August 25th, 2010 at 12:49 pm
Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY
August 25th, 2010 at 12:55 pm
Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY thnks@justincutroni
August 25th, 2010 at 1:15 pm
RT @justincutroni: Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY
August 25th, 2010 at 1:43 pm
RT @justincutroni: Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY
August 25th, 2010 at 2:44 pm
Twitter doorzoeken? Gebruik topsy! Zij hebben 5 miljard tweets geïndexeerd. http://bit.ly/akQRnY
August 25th, 2010 at 2:55 pm
@markhuot I haven't tried it, but apparently Topsy http://bit.ly/akQRnY has indexed 5bn tweets (!)
August 25th, 2010 at 3:02 pm
Twitter doorzoeken? Gebruik topsy! Zij hebben 5 miljard tweets geïndexeerd. http://bit.ly/akQRnY #socialmedia
August 25th, 2010 at 4:51 pm
RT @justincutroni: Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY
August 26th, 2010 at 5:41 pm
RT @justincutroni Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY
August 27th, 2010 at 5:47 pm
RT @mcbuzz: RT @justincutroni Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY
August 27th, 2010 at 7:39 pm
Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/am1usr
September 7th, 2010 at 6:26 pm
RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi
September 27th, 2010 at 4:24 pm
Topsy deploys v2 platform to index 100 billion status updates http://goo.gl/fb/Cbis5
September 27th, 2010 at 4:30 pm
Topsy deploys v2 platform to index 100 billion status updates: http://bit.ly/aLyjYK Comments: http://bit.ly/dDms3i
September 27th, 2010 at 4:41 pm
Topsy deploys v2 platform to index 100 billion status updates: Comments http://bit.ly/ceWydN
September 27th, 2010 at 4:56 pm
Topsy deploys v2 platform to index 100 billion status updates – http://su.pr/16Z3Z0
September 27th, 2010 at 5:48 pm
Very interested to hear about @topsy's new platform, http://bit.ly/8Zodpo I like the talk about battling the four demons of search…
September 28th, 2010 at 12:43 am
Topsy deploys platform to index 100 billion status updates on Twitter: http://is.gd/fwAeA #search #wow #guy