The Order Of Things

Oil Refinery
Unless you’ve read the manual, messing with the startup sequence of an oil refinery is not recommended.
(photo: Walter Siegmund, license: CC BY 2.5)

The expression “there’s more than one way to skin a cat” nicely summarizes the “change the order” troubleshooting strategy. The concept is to alter the sequence of steps being taken when starting up, configuring, or operating a machine. Among these multiple pathways there will be some combinations which work, and some that won’t. Those pathways may be identical, except for the order in which things happen. By simply rearranging the progression of events, you can sometimes get a system working again.

When using this strategy, don’t feel like you have to necessarily understand why one pathway works and another fails. Fix first and ask questions later. The ability to understand why one particular sequence works and another doesn’t may be beyond even the understanding of the original designer. Given that so many products today are integrations of parts cobbled together from multiple manufacturers and outsourced teams splayed around the globe, this isn’t surprising. That is, not all use cases or interactions among components may be well understood. As a troubleshooter, you should be focused on the outcome of getting things running again. Answering the question “Why?” has an economic component to it: sometimes there is a hefty cost in terms of time and effort to fully understand a problem. When knowledge is pricey, you may be better off in blissful ignorance!

On Startup

If you have a machine with multiple subsystems, interactions between them while they’re starting up may cause a failure. Consider a simple web site that consists of a database, web server and application server. These 3 elements can be started up in the following different ways:

Sequence
Order of Startup
1st 2nd 3rd
A
Application Server Database Server Web Server
B
Application Server Web Server Database Server
C
Database Server Application Server Web Server
D
Database Server Web Server Application Server
E
Web Server Application Server Database Server
F
Web Server Database Server Application Server

That’s just 3 components and look how many different ways there are to initialize the system (6 to be exact, lettered A-F in the table above). The complexity grows exponentially, with the factorial of the number of subsystems (n) expressed as:

n!

Here’s a table that shows you just how quickly the potential complexity of a startup sequence can grow as the number of subsystems increases:

# of Subsystems
# of
Unique Startup Sequences
1
1
2
2
3
6
4
24
5
120
6
720
7
5,040
8
40,320
9
362,880
10
3,628,800

Astonishing, right?!

Back to our example: you’ve just upgraded the application server and, unbeknownst to you, the new version of the application server checks to see if the web server is running first. If it isn’t, the application server will mysteriously quit without explanation. This means certain startup sequences will result in a failure (specifically: A, B, and C), which I’ve highlighted in red in the table below:

Startup Sequence
Order of Startup
Starts up successfully?
1st 2nd 3rd
A
Application Server Database Server Web Server
NO
B
Application Server Web Server Database Server
NO
C
Database Server Application Server Web Server
NO
D
Database Server Web Server Application Server
YES
E
Web Server Application Server Database Server
YES
F
Web Server Database Server Application Server
YES

However, sequences lettered D, E, and F will be fine. If you had encountered this problem “in the wild” and simply cycled through the various startup sequence possibilities, you could have fixed the problem.

Tweaking The “Change The Order” Strategy

As you can see from the table showing the growth in the number of startup possibilities, even with just 5 sub-systems, there are 120 unique startup sequences. That’s a lot of things to try! A shortcut is to first make an educated guess as to the failing subsystem. In any troubleshooting exercise, there will frequently be supplementary evidence pointing to the culprit. In the example above, it could have been that the application server was missing from the list of running processes. After you’ve made your guess, try these 2 things:

  1. Move the suspicious subsystem to the first position in the startup sequence.
  2. Move the suspicious subsystem to the last position in the startup sequence.

Putting the suspicious subsystem first or last gets quickly to the underlying reason as to why the “change the order” strategy works: something needs to be present or absent for the subsystem to function. Putting it first ensures that, with all the other subsystems off, nothing can interfere with its ability to function. However, if the failing subsystem requires another subsystem to function, starting it up at the end gives it the greatest chance of meeting this condition because all possible dependencies will be satisfied.

For the ultimate example of a complicated initialization sequence, consider a modern oil refinery. It can take weeks to bring a refinery on-line from a completely cold start. Of course, if you tinker with that kind of startup sequence as a troubleshooting strategy, you better know what you’re doing!

When Configuring

The same principle applies to configuring a machine; changing the order in which you apply your configuration options may make the difference between it working and it not working. Take the example of a network router, which you want to set up as such:

  1. Add an entry to the routing table.
  2. Set the default gateway.
  3. Block connections from all outside networks.
  4. Allow connections from select internal networks.

You notice that, after applying these options in the order listed above, the router will consistently crash and reboot itself after applying rule #2 (i.e., you never make it to #3). Let’s say that, after playing around with the above order, you can get the router to work by placing rule #2 at the end of your configuration recipe. According to what you know about what the router should be capable of, there’s no good reason why you shouldn’t be able to do the configuration in your original order, but keep in mind that troubleshooting is dealing with reality. The reality of the situation is that no amount of hemming and hawing over how the device should work will make your original configuration stick. You tried it. It didn’t work. Move on.

Also note that the math concerning the number of possibilities is the same: if you want to count the number of unique ways to apply a set of configuration options, it’s also n!. Again, if the number is large you’ll want to make an educated guess as to the option most likely to be causing the problem and move that to the front or to the end.

When Operating

This general strategy is also effective, for the same reasons, when changing the order in which a system does its work. You may have a collection of tasks (A, B, C) that need to be done, but the final result is independent of the sequence in which these tasks are completed. That is, A → B → C results in the same output as all the other possibilities: A C → B, B → A → C, B → C → A, C → B → A, and C A → B. While the order of a workflow may not matter theoretically, as we’ve seen, sometimes in practice it makes a big difference.

Imagine a computer program that validates data coming from a form on a web page, checking that a customer’s name, address, and telephone number are all formatted correctly for a database. Offhand, you’d say that it shouldn’t matter in what order these validations are made: all three fields must eventually be checked for the data to be made safe for the database. However, let’s say that our telephone number validator uses a code library with a bug that causes corruption for the text validator. That means that the program will only work if the telephone number is validated last. Of course, this isn’t how code libraries are supposed to work: they should be bug free and well-tested! However, don’t let a fixation with “should” get in your way and blind you to considering options that achieve your end goal. Can you rearrange the steps of your workflow, achieve the same result, and do an end-run around your issue all at once? Many times, the answer is “Yes!”

*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***

One Comment

Add yours →

  1. I’m dated in my memory, but I thought a Japanese man in Quality Improvment devised “Taguchi Method” to reduce experimentation sequencing more efficiently, than just doing all the combinations. Idea to reduce experiments to achieve findings more efficiently. Thx. for site.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: