From the book "Release It!", the author describes an incident where an airline's entire check-in system went down for three hours, grounding its hundreds of planes and causing a pretty big backlog for hours more. The 'root cause' was code on the flight search server:
close() can throw, and in the circumstance of the outage it did for stmt, leading to the connection not getting closed and eventually the pool being exhausted with every thread blocked waiting for a connection. It's an interesting chain of failures, arguably the presence of such a chain is the real root cause, rather than the unhandled sql exception.
the presence of such a chain is the real root cause, rather than the unhandled sql exception.
This is really interesting and something which bugs me about root cause analysis and it's a neat coincidence that this has been quoted relative to an aviation incident.
In aviation, incidents and accidents are investigated with the understanding that there is never a single cause of an accident. It's known as the swiss cheese model. All the holes in the swiss cheese have to line up for something to go wrong. Even in a seemingly simple "pilot error" accident, there are years of initial and recurrent training factors, ergonomic and human factors and so on which all lead to the event. It's exceedingly rare for a single "root cause" to be the whole story.
Medicine is starting to adopt techniques learned from aviation like checklists, crew resource management and no-blame, swiss-cheese accident investigations. I am hopeful that the software industry will take similar lessons over the next decade or so.
> like checklists, crew resource management and no-blame, swiss-cheese accident investigations. I am hopeful that the software industry will take similar lessons over the next decade or so.
The software industry that programs space craft?
The software industry is vast and not every system involves copious amounts of human decision making. Often the idea of the root system cause, and the root process cause(software construction, operations, etc) cause is separable.
I would say that aviation is almost inverted in the that regard compared to booking systems, banking systems, and most of what software engineers are exposed to. A person can not fly from Dallas to Chicago without many, many human decisions being involved. However, a packet traveling from Dallas to Chicago involves nearly zero new human interactions.
I know people like to hate on errors as values as in Go, but I think exceptions are worse when it comes to unexpected side effects and this is a prime example!
Yes; although in GC’ed languages like Go and JS it’s still very easy to leak OS level resources like file handles because you still need to remember to close() them. (Although go’s defer blocks are a fantastic assist here).
This is one area Rust really excels - the same mechanism for making sure memory gets cleaned up also automatically closes network sockets and file descriptors when they go out of scope. Even in the case of errors it’s impossible to forget to clean up. That entire finally block is unnecessary in rust.
I'm learning Rust coming from Go. It looks cool, but it also concerns me how most data structures in the stdlib use unsafe blocks to defeat the borrow checker. This is not the point of Rust, I would have thought?!
One of the criteria for belonging in the standard library is “needs a lot of unsafe to implement”, so the standard library has more unsafe code than your average codebase.
Beyond that, to some degree, it is the point of Rust: limit unsafe things so that you can reason about them more easily. The CPU is inherently not safe, so it has to exist on some level. rust gives you tools to manage this.
"Release It" is an awesome book. I contributed a couple FindBug Java checkers to warn about some of the problems the book describes.
The case you mention above might have been prevented by using checked Java exceptions. Our programming languages and tools could be doing a lot more to catch these problems at compile time or make them impossible by language or API design.
My favorite is constructors of things that taken input/output streams that themselves can throw.
It becomes very verbose to ensure that the unwinding happens correctly.