A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -Leslie Lamport
On my way to QCon Tokyo and QCon China, I had some time to kill so I
headed over to Delta's Skyclub lounge. I've been a member for a few
years now. And why not? I mean, who could pass up tepid coffee, stale
party snacks, and a TV permanently locked to CNN? Wait... that actually
doesn't sound like such a hot deal.
Oh! I remember, it's for the wifi access. (Well, that plus reliably
clean bathrooms, but we need not discuss that.) Being able to count on
wifi access without paying for yet another data plan has been pretty
helpful for me. (As an aside, I might change my tune once I try a mifi box. Carrying my own hotspot sounds even better.)
Like most wifi providers, the Skyclub has a captive portal. Before you
can get a TCP/IP connection to anything, you have to submit a form
with a checkbox to agree to 89 pages of terms and conditions. I'm well
aware that Delta's lawyers are trying to make sure the company isn't
liable if I go downloading bootlegs of every Ally McBeal episode. But
I really don't know if these agreements are enforceable. For all I
know, page 83 has me agreeing to 7 years indentured servitude cleaning
Delta's toilets.
Anyway, Delta has outsourced operations of their wifi network to
Concourse Communications. And apparently, they've had an outage all
morning that has blocked anyone from using wifi in the Minneapolis
Skyclubs. When I submit the form with the checkbox, I get the
following error page:
Including this bit of stacktrace:
There's a lot to dislike here.
-
Why is this yelling at me, the user? To anyone who isn't a web
site developer, this makes it sound like the user did something
wrong. There's a ton of scary language here: "instance-specific
error", "allow remote connections", "Named Pipes Provider"... heck,
this sounds like it's accusing the user of hacking servers. "Stack
trace" sure sounds like the Feds are hot on somebody's trail, doesn't
it?
-
Isn't it fabulous to know that Ken keeps his projects on his D:
drive? If I had to lay bets, I'd say that Ken screwed up his
configuration string. In fact, the whole problem smells like a failed
deployment or poorly executed change. Ken probably pushed some code
out late on a Friday afternoon, then boogied out of town. My
prediction (totally unverifiable, of course) is that this problem will
take less than 5 minutes to resolve, once Ken gets his ass back from
the beach.
-
We mere users get to see quite a bit of internal
information here. Nothing really damaging, unless of course Wilson
ORMapper has some security defects or something like that.
-
Stepping back from this specific error message, we have the larger
question: is it sensible to couple availability of the network to the
availability of this check-the-box application? Accessing the network
is the primary purpose of this whole system. It is the most critical
feature. Is collecting a compulsory boolean "true" from every user
really as important as the reason the whole damn thing was built in
the first place? Of course not! (As an aside, this is an example of Le Chatelier's Principle: "Complex systems tend to oppose their own proper function.")
We see this kind of operational coupling all the time. Non-critical
features are allowed to damage or destroy critical features. Maybe
there's a single thread pool that services all kinds of requests,
rather than reserving a separate pool for the important things. Maybe
a process is overly linearized and doesn't allow for secondary,
after-the-fact processing. Or, maybe a critical and a non-critical
system both share an enterprise service---producing a common-mode
dependency.
Whatever the proximate cause, the underlying problem is
lack of diligence in operational decoupling.