The Hum of The Machine
Rachel Davies has been writing about some insights she gained from a tour of the Toyota plant in Derby, which reminded me I should write something about a couple of concepts Steve Jones (of Egg) and I worked on as analogies for large scale systems monitoring.
We started by discussing what it was that made mechanical production systems easier to monitor than IT systems and were talking about visibility. This rapidly turned into a discussion about factories, and then cars and their engines. Put two blokes together and if they don't end up talking about football they'll end up talking about cars.
Only this time it was interesting becuase we realised that visibility is not the key. It's audibility. The Hum of The Machine. When you're driving you can't see the engine, you rely on the sound of the car a lot - it's the clunking sound that forces you to take a look or take it to the garage. It's the same effect that allows an experienced supervisor on a printing press, or other automated mechanical machine, to sit and read a book or watch TV safe in the knowledge that all is well. If the tone or rhythm changes then the book gets put down and further investigation takes place.
But with IT systems the things that are most important to us are silent. I think the idea of having web servers shout out when they've successfully completed a request (or thousand) is very compelling and engaging at a level that systems monitoring tools like Patrol and OpenView just aren't. The key difference between Rachel's thoughts and the ideas Steve and I discussed is that Steve and I suggest having sound as the normal, healthy state as with a mechanical system.
The other concept we decided was worth pursuing was the idea of variable monitoring and tracking of individual transactions. A concept taken from, and named after, The Barium Swallow - a medical investigative procedure that involves swallowing a mildly radioactive liquid and watching the passage of that through the oesophagus on-screen.
The equivalent we discusssed here was the ability to send genuine requests into a live system, but tagged in some way, to trigger a higher level of logging and/or alerting. This kind of facility would allow diagnosis of problems in live systems easier and more viable than simply turning up the logging as it would provide detailed logging for a single transaction rather than the vast volumes of logging that you get otherwise.
If anyone knows of any commercial products doing this then I'd love to hear from you.