Topsy deploys v2 platform to index 100 billion status updates

{topsy_retweet_button} We’ve had a fun week over here at Topsy. We finished the roll-out of version 2.0 of our platform, a major engineering and operational milestone for us.

Topsy is now the largest searchable index of content posted on Twitter – we recently indexed our 5 billionth tweet and 2.5 billionth link. Unlike most retrieval systems, Topsy organizes its search index in real-time, while still maintaining a long-term history. Our v2 architecture takes our search approach to a new level of scale – it is designed to index over 100 billion status updates and related objects, from any social network.

Under the hood

For search and operations geeks out there, here’s a peek into how Topsy’s v2 index architecture works. Our platform spans a cluster of over 500 servers and roughly a petabyte of storage. We receive tweets from Twitter through the Firehose (and other sources like search.twitter.com and Twitter’s REST API). Each tweet is written to our distributed queuing store (called the Swarm). Swarm provides a mechanism for hundreds of tasks to process a status update within milliseconds of arrival to turn it into indexable data. A typical pipeline to process a tweet looks like this:

  1. Metadata for the tweet, including user information (such as language, location, user-ID, profile photo) is extracted.
  2. All links present in the tweet are extracted, short URLs are expanded and URL redirects resolved, and the link is visited to fetch titles, descriptions, and to generate thumbnails for images.
  3. Relationships to other tweets or Twitter users are extracted and a link is made and verified. (For example, Topsy tracks replies to tweets, and finds and verifies original tweets for organic retweets by computing a similarity score for the original and retweeted text)
  4. A citation is created from the tweet, associating the user and the tweet text to links in the tweet (or to the original tweet if it’s a retweet).
  5. The text of the tweet is parsed and tokenized to make it easier to search regardless of the language of the text. We do linguistic analysis making it possible to search Topsy for Japanese and Simplified/Traditional Chinese text, among other languages.
  6. The tweet is loaded into the search index; Topsy has a unique citation index model, in which a text of a status update – treated as a citation – remains associated with its author and to links it contains. This allows the search engine to rank tweets, photos, and websites by relevance, social citation scores, and time.
  7. The tweet is loaded into Topsy’s in-memory Grapher, so that it shows up on Topsy Author pages and Topsy Trackback pages.

This pipeline runs tens of thousands of times a second and generates a 10x fanout (each tweet results in 10 or more pieces of data). We add close to 500 million pieces of data to our index every day.

Four Demons of Search

Traditional search architectures are designed to process large volumes data in batch processes. The large-batch approach doesn’t work for processing and indexing on real-time streams.

Search engines are constantly battling what we call the Four Demons of Search: query speed, relevance, frequency of index updates and recall (indexing historical data). Traditional web search engines compromise on update frequency; real-time search engines typically compromise on recall (historical data) and relevance. At Topsy, we’ve architected our search platform to scale up in all four dimensions using modern techniques for indexing and run-time query processing.

This is an exciting new direction in information retrieval, especially as large amounts of real-time data are being created and become available on the web. In a future post, we’ll cover the design choices that have resulted in our approach and architecture.

28 Responses to “Topsy deploys v2 platform to index 100 billion status updates”

  1. Ted Cui Says:

    RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  2. Amruta Moktali Says:

    Topsy under the hood !!!RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  3. Rishab Aiyer Ghosh Says:

    Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  4. sjvn Says:

    RT @r2g2: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi Me: Twitter search engine to keep an eye on.

  5. Mike More Says:

    Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  6. Larry Osborn Says:

    RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  7. Developers @oneforty Says:

    v2 of @topsy platform indexing 100 billion tweets: http://bit.ly/akQRnY Good blog post on their underlying architecture.

  8. Dan Brumleve Says:

    RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  9. SmediaC Says:

    Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/ckmCgO

  10. Eldon Edwards Says:

    RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  11. Justin Cutroni Says:

    Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY

  12. Bibi Mukherjee Says:

    Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY thnks@justincutroni

  13. Dave Reichenbacher Says:

    RT @justincutroni: Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY

  14. Даниил Азовских Says:

    RT @justincutroni: Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY

  15. spalburg Says:

    Twitter doorzoeken? Gebruik topsy! Zij hebben 5 miljard tweets geïndexeerd. http://bit.ly/akQRnY

  16. lukestevens Says:

    @markhuot I haven't tried it, but apparently Topsy http://bit.ly/akQRnY has indexed 5bn tweets (!)

  17. spalburg Says:

    Twitter doorzoeken? Gebruik topsy! Zij hebben 5 miljard tweets geïndexeerd. http://bit.ly/akQRnY #socialmedia

  18. Billy Girlardo Says:

    RT @justincutroni: Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY

  19. Mark McLaren Says:

    RT @justincutroni Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY

  20. Amelia Taylor Says:

    RT @mcbuzz: RT @justincutroni Need to search Twitter? Use Topsy! They've indexed 5 billion tweets back to May 08. http://bit.ly/akQRnY

  21. Gary Thomas Says:

    Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/am1usr

  22. Moose Harrington VI Says:

    RT @topsy: Topsy deploys v2 platform to index 100 billion status updates http://bit.ly/dxvzEi

  23. Hacker News YC Says:

    Topsy deploys v2 platform to index 100 billion status updates http://goo.gl/fb/Cbis5

  24. Hacker News Says:

    Topsy deploys v2 platform to index 100 billion status updates: http://bit.ly/aLyjYK Comments: http://bit.ly/dDms3i

  25. pok rap Says:

    Topsy deploys v2 platform to index 100 billion status updates: Comments http://bit.ly/ceWydN

  26. eBot Says:

    Topsy deploys v2 platform to index 100 billion status updates – http://su.pr/16Z3Z0

  27. Josh Lucas Says:

    Very interested to hear about @topsy's new platform, http://bit.ly/8Zodpo I like the talk about battling the four demons of search…

  28. Loggly, Inc. Says:

    Topsy deploys platform to index 100 billion status updates on Twitter: http://is.gd/fwAeA #search #wow #guy

Leave a Reply

You must be logged in to post a comment.