THE BLOG

Lessons I Learned from Disney – Part 3 (Balancing Urgency with Impact)

career success Apr 09, 2015

Mid afternoon, around 3:00 pm, is a prime time for checking into hotels, and the same holds true for Disney’s many resorts. These days, when you check into a Disney resort, you’re given a Magic Band, but when I was working there in the year 2000, guests were given a Key to the World card. Guests used their card to open the door to their room, purchase food, buy merchandise, and get access to the Disney World parks. As a result, getting a call one afternoon around 3:00 pm that a couple of Disney’s resorts (specifically, Disney’s Yacht and Beach Club Resorts) could not create Key to the World cards was a big deal.

Several of us piled into a company van and headed over to Epcot. We took an elevator down three floors below Future World and went into an equipment room containing the Cisco Catalyst switches servicing nearby resorts. The LED load indicator on the Cisco Catalyst’s supervisor engine indicated the switch was under a significant processor load, and we did some packet captures to try and determine what was going on. We soon came to the conclusion that we had a broadcast storm, caused by a Layer 2 topological loop. This broadcast storm prevented terminals in the resorts from communicating on the network.

But isn’t Spanning Tree Protocol (STP) supposed to prevent this type of thing from happening? It sure is, but STP had failed on the supervisor engine of the Cisco Catalyst switch.

At this point we had a decision to make. To fully resolve the issue, someone would have to get back in the van, drive back to our office, get a replacement supervisor engine, and bring it back to Epcot. Meanwhile, someone else would need to be backing up the existing configuration on the failed supervisor engine, so that the configuration could be applied to the replacement supervisor engine when it arrived. After the replacement supervisor engine did arrive, the switch would need to be powered down, have the supervisor engines swapped out, and have the backup configuration applied to the replacement supervisor engine after powering the switch back on. Please keep in mind that during the time required to accomplish all of this, guests would be milling about the lobbies of the Yacht and Beach Club resorts, waiting to check in, and loosing their pixie dust by the second.

The immediate course of action we took was to simply unplug one of the redundant links, breaking the Layer 2 topological loop, reasoning that it was better to have a non-redundant yet functional network than a redundant network that was not functional. Later that evening (to minimize impact on users) someone came back and swapped out the switch’s supervisor engine, thus restoring redundancy.

The Takeaway

What lesson is to be learned from this? It’s weighing urgency against impact. In other words, will the action you’re about to take have a significant impact on users? If it will, is that action justifiable based on the issue you’re trying to resolve?

As an extreme example, imagine that you’re troubleshooting a networked printer serving a couple of users in an office building. You conclude that the switch port to which the printer is connected has failed, and the switch has no spare ports available. You decide that to fix the printer connectivity issue, you need to swap out a 48-port switch. The question is, do you swap out the switch during working hours or wait till later. If you swap it out right away, then the printer issue will be resolved, but you’ll be inflicting temporary pain on dozens of other users when you make the swap. In a case like that, I would suggest the the urgency of the situation (one printer not being available to two users) does not justify the impact (temporarily disconnecting multiple users from the network).

Network engineers often need to make these types of decisions, and although weighing urgency with impact might seem intuitive, in the heat of battle it is easy to make the knee-jerk reaction of doing whatever is necessary to get the problem resolved right now. That’s something that I’ve been guilty of many times in my early days as a network engineer for a university, and my director tore into me in a less than tactful way pointing out this lesson to me.

Hopefully, my little “tale from the trenches” will come to mind the next time you’re faced with such a decision, and you’ll take a moment to make a strategic decision, which is not necessarily a decision to resolve a problem immediately and at all costs.