More than 3.4 million North American Skype users, about 12% of those online at the time, were affected by an ISP service fumble, with reduced or no access to Skype dialtone for up to 90 minutes today. Phyber Communications reports ISPs appear to have been affected by Juniper routers on Level3 networks, including TimeWarnerCable Internet.
Fame. Because a rumor Justin Bieber tweeted his Skype name caused a swarm of new Skype downloads, new user accounts, and millions of futile IM and call attempts.
Skype’s network crashed from 23.7 million active users to 1.8 today, more than a small number. It’s rebounding, now around 12.7 million people logged in. A spurt of software downloads came with the crash, although Skype blogs this is not necessary. Darned misconfigurations.
And yet I’m unable to see an outage in Skype’s stats showing the number of users online. I can only imagine the service interruption was short, the interruption was intermittent, it affected only a few people, or the data we’re seeing is incorrect.
Lars Rabbe, Skype’s CIO, reports today “around 50% of all Skype users globally were running the 5.0.0.152 version of Skype for Windows.” .152 is the build that crashed when another part of the network, offline instant messaging, became overloaded. So half of Skype’s users were using other versions and were not immediately affected. This slowed the onset of the outage.
Sven Tigane blogged about the importance of updating just the week before the Great Skype Supernode Crash of 2010. “At Skype, we regularly release updates to our software, and with every update, we introduce new features, improve existing ones and fix bugs. Keeping Skype up to date is the only way you can take advantage of these improvements, and is vital to making sure that Skype remains safe and secure.” Sven is programme manager for Skype for Windows and long time Skype employee; his views matter.
Skype’s deployed client inventory (the software we all use) is highly diverse. Millions of users are on dozens of versions of multiple operating systems with many kinds of hardware and very different network configurations. Simplifying, unifying, getting everyone to use the latest version will make life much easier, faster, and cheaper for Skype’s customer support, security, network administration and software development teams.
Realistically, no system is totally fail safe. One can only design to reduce the chances and minimize the impact. The problem with the current design and market condition is that most of the supernodes will be running the same OS and the same version of the app. At this stage, Skype can easily deploy their own supernodes, but maintain the P2P architecture as a way to handle growth, failover etc. They just have to make sure that the supernodes run diff OSs and they upgrade their app in a staggered fashion. In other words, they need to come up with an operational plan.
Aswath is saying Skype is suffering from monoculture: too much uniformity.
Like farming wheat in massive farms, you want the efficiency that comes with having your whole crop be one species. You get better deals on seed, only have to think about the nutrient needs of the one species, and you can harvest your whole crop at the same time. Following the simile: it’s cheaper to make the software, less complex so more reliable, and improvements propagate to your whole network.
Monoculture has its downsides. One pest can ruin your entire crop. We saw that problem in both of Skype’s major outages. Small code or settings adjustments in a popular version of the program spread like wheat rust. When they create problems, those spread quickly too.
Planned biodiversity is a response. In farming you introduce slightly different strains of your crop. In software and networks, you keep multiple versions going in the field. You keep multiple strains of protocols and architectures running at the same time. So problems, when they spread, hit limited populations. When some varieties are affected by a blight or an attack, and others aren’t, you know those differences can lead you to the disease or infestation vectors.
So here’s to keeping your last version of Skype around. And waiting a day or two for updates.
Hi Phil. I’d like to clear up the node to supernode relationship. Each supernode determines it’s client connection list to be full at 350 active tcp connections. This means a supernode is considered at maximum capacity at roughly ~350 client connections however they will and do handle more client connections up to a degree depending on system variables. ATM there are ~58K supernodes "active" and ~116K "idle". My data during the outage listed clients and supernodes however I truncated the dataset to begin at about ~1.5 MM nodes. The decline is linear with online supernodes and clients and represents a "live" graph of the overlay undergoing "segmentation" due to a failure in the main supernode backbone.
CORRECTIONS:
First, while Cain’s data is correct, I mislabeled it. These are 1.4 million nodes, not supernodes. By 3:46PM Eastern in the US, the Skype network crash had been going on for many hours. So you’re seeing the last leg of the collapse. Cain labeled the data correctly; it was my transcription error.
Second, I should have recognized the error. I was confused by two things. All the other information and videos focused on supernode behavior, so I was myopic. I completely missed out on the time zone differences so I didn’t notice that this data really fit into a different place on the event timeline.
Thanks to several folks at Skype who urged me to dig a bit deeper and to anonymous commenter Chupacabra who wrote “1.4 million supernodes? Thats rubbish!”
My response to Chupacabra:
@Chupacabra, I thought that at first too. And you may be right since nobody at Skype is talking. So let’s do some back of the napkin calculations. In the olden days, say 2005 or so, it was said that a supernode could support between 500 and 1000 active nodes easily. On that fateful day last week, there were about 25 million accounts online when things began to go wrong. Very few people have more than one copy of Skype running, so let’s leave that to rounding error for the moment and say there were 25 million nodes running. So 1000 goes into 25,000,000 twenty-five thousand times. 50K supernodes if we say each supports only 500 nodes. 25 to 50K feel right to me.
Are we dividing into the right population? Skype has about 150 million active users over the course of a few weeks, about six times peak levels. So that would make the number of supernodes about 150K to 300K. Still far short of 1400K.
1.4 million supernodes is not what we’d expect.
So,
- Has the math changed? If Cain’s observations were correct, does a supernode now support, say, just 20 other nodes? That would be a massive drop in efficiency.
- Did Cain observe something else, not a supernode but, perhaps, a node that was capable of becoming a supernode?
- Did Cain only find nodes – not supernodes – that were visible from his vantage point in the network? A large but limited subset?
- Did Cain make a fundamental mistake in data collection and processing?
- Did I report it wrong?
I’d like to test his rig, but his response to a request for an interview:
“Hi, Thanks for the offer but I am not speaking to the press for the foreseeable future. Regards, Julian”
Almost 25 million accounts were online when Skype’s 1.4 million supernodes started leaving the cloud. It took 330 minutes for 98% of the supernodes to go offline, cutting off nearly all Skype users. Researcher Julian Cain set up a UDP probe to look at the Skype network as it crashed last week. At the bottom of the blackout, Cain demonstrated how his reverse engineered Skype client attempted to connect to the network and was rejected, like all the other Skype clients struggling to reconnect.
The chart shows 98% of the Skype supernodes leaving the Skype network over 5.5 hours (data). Cain puts minimum viability for the p2p mesh at about 75K nodes.
Cain shot a short video after the network crash. He is showing his “fully functioning 3rd party Skype peer-to-peer stack during the global overlay outage.”
10:42 PM Eastern, still unable to connect with Skype. He shows traceroutes to Luxembourg supernodes operated by Skype, mostly pooled within the same IP ranges. Here’s a list of hard-coded Skype IP addresses from earlier this year. He shows his own reverse engineered, third-party implementation of a Skype client. “It’s the first ever.” He uses Skype’s login server as a bootstrap supernode. “Here I perform a UDP probe of all of the supernodes that ship with the Skype binary to check responsiveness.” He makes his own SkyLib (a core Skype messaging component) and pinged the Skype mega-supernode. He receives a NACK (negative acknowledgement) from the Skype supernode and connects to the network. The network drops his TCP/IP connection, just like any other Skype client in the outage.
Skype promised to report on its internal post-mortem this week. Let’s hope we get this level of disclosure.
The Guys From Queens Podcast Network team switched to Microsoft Live Messenger 2011 and Google Talk to get through the outage. “Nobody has delivered for our needs, anyway, the quality and stability that Skype is normally famous for.” Tom: “My business relies on it.”
They talk through alternatives and why they use Skype for their podcasts. They make fun of the Cisco umi. From the start through minute 14:00.
Quora is hosting questions about Skype’s pre-Christmas 2010 outage. See how people are answering, clarify the questions, answer a few or, please, ask better questions.
“Is it reasonable to complain about a free service being down? Why or why not?A lot of people complaining about it being down, can you really complain about something that you don’t pay for? In my opinion of course you can be annoyed but complaining to the company about it? I’m afraid this is the risk you take when you decide to use a service that you don’t financially support.”
Complaining about the outage of a free service is not just a right, it’s a duty.
Complaining is feedback to the host of the free service that an outage matters. Seriously, there are sites that could go down for a week and nobody would notice.
Complaining signals fellow users of a problem’s existence. We don’t all use a service at the same time. Skype, for example, rarely has more than one sixth of its active users logged in. So those who experience a service interruption spread the warning to the rest of the users.
Complaining builds community. You are not alone in the dark during this outage; you are sharing that experience with others.
Complaining characterizes the interruption’s scope. It is useful to know when the problem is local (and if it affects you); in Skype’s outages one of the first alerts are "I’m having trouble in Singapore" and "Skype 5 is not loading properly in Florida."
Complaining adds the emotional overlay to contextualize the problem. Did you miss a service because it was inconvenient or a little slow? Did the downtime threaten your safety, livelihood, freedom. Has it cost you money? Do you feel betrayed? All the stakeholders need to understand why customers care, individually and collectively.
Complaining triggers survival instincts among users. People may swarm to help a service (like throttling down consumption if a service is overwhelmed or upgrading to a new client), seek out temporary solutions (working offline, downshifting from group video calls to group audio calls), or design exit strategies (sharing how to extract your information assets, how to migrate to alternative services, how to notify your contacts that you’re moving).
Complaining about an outage can provide useful data to recovery from the outage. Sometimes.
Most people never complain. So complainants speak for the user majority.
Complaining defines this moment in history. Our kvetch about service interruptions inform a service’s reputation and is part of its legend.
Complaining is a productive contribution to business, social and technological ecosystems. Services like GetSatisfaction try to channel complaining, as do other kinds of feedback analytic tools.
Beyond duty to your community, kvetching can feel exquisite. In many cultures, you can heal by getting your strong feelings of frustration and disappointment off your chest. Is complaint a therapy for long term service separation anxiety? Let’s leave that to the Fail Whale.
Photo credit: “rage” cc-by-sharealike by how will i ever, December 10, 2008.
All systems are back online, according to today’s Skype blog update. Here’s a link to the full size of the chart below.
Skypers were active the week before the crash , near Skype’s all time dial tone high watermark, peaking near 25 million accounts logged-in. You can see from the insert in the chart that activity stayed high right through the previous weekend.
Skype’s CIO texted CEO Tony Bates at 4pm Wednesday London time that the network was crashing. Assess, assess, asses. Troubleshoot, troubleshoot, troubleshoot. P2P team steals servers from other departments. Thousands of MegaSuperNodes to the rescue. Core network comes back steadily. Logins throttled and other Skype services turned off. Network continues to come back online. Other departments get their servers back (or near enough) and group video, presence, Skype Manager, and offline instant messaging come back online. Most of the Skype team rushes home for Christmas or Shabbos. Users start their Christmas calling. Fah who foraze! Dah who doraze!
Serious critiques inside and outside of Skype start next week after the eggnog wears off.
Number of people logged in with Skype dial tone: 90% of normal. #skypemebaby
Offline IM: still unavailable. #Groan
Group video calling: still unavailable. #ooVoo
Skype rules out malicious causes for the outage. #wheresJulianAssange
Skype will send 30 minute anywhere-landline calling cards to Pay As You Go and Pre-Pay customers. #iwannabeabillionaire
Skype will extend active subscriptions for a week. #AndOnTheSeventhDaySkypeRested
Caption Contest.
Twenty words or less, no swearing, and no sex-related captions please!
US winner gets a freetalk everyman headset, made for Skyping. Skype and contractor employees may play but aren’t eligible for prizes. Leave your captions in the comments or tweet to @SkypeJournal.
Do you have 100 employees using Skype at one location?
Worried you won’t have enough Skype supernodes to go around?
Worry no more! Now you can buy the Reef9 Node Master for only $495. Just plug it in outside your firewall and watch it go. It will spin up dozens of Skype supernodes. Near your users and always on, so you always have the best access to the Skype network money can buy for just pennies a day. For an extra $45 per year, Reef9 will update your Skype clients with the latest in Skype P2P technology.
Skypelandic engineers turned ordinary blog and accounting servers into powerful superheroes, harnessing Cloud Power to restore conversation to peaceful Skypelandia. Rolf, one of the Mighty MegaSuperNodes, posed for this portrait.