Skype Explains the Big December Failure
By on December 29th, 2010

skypeOn December 22nd, millions of Skype users were left without service for roughly 24 hours. At the time, Skype was scrambling to get the system back up and didn’t offer a full explanation of what was happening. (what is Skype?)

Earlier today, Skype’s CIO, Lars Rabbe, gave a fairly detailed explanation about the system wide failure. Skype depends heavily on a world wide network of peer-to-peer nodesand supernodesthat are hosted by users running Skype’s software. This network distributes the service’s work load to each Skype user as needed.

According to Lars, a cluster of Skype servers overloaded and threw more of the load onto the peer-to-peer network. Normally, this should have only slowed the network down. Instead, a bug in some of the Windows clients running a newer Skype version, cause many to completely fail. About 50% of the peer-to-peer network stopped responding, and the entire network collapsed like a house of cards.

Skype technicians responded by creating new peers on the network called mega-supernodesto try to recover normal traffic, but the recovery still took a long time.

So what has Skype learned? Can they prevent downtime in the future? Here’s what Lars said about the future:

… we are learning the lessons we can from this incident and reviewing our processes and procedures, looking in particular for ways in which we can detect problems more quickly to potentially avoid such outages altogether, and ways to recover the system more rapidly after a failure.

Skype has become a critical communication tool for many individuals and companies. If Skype can avoid major disasters in the future, they’ll remain the king of VOIP. If not, they have plenty of competition waiting to jump in as a replacement.

