Once again inspired by something I found on the Daily WTF.
A few years ago, I worked with a team of developers on their in-house order entry system. The system had been written over the course of several years with no real architectural plan. Just build it as they ask for it. Of course, the system was wrought with issues. But, it did the job well enough and the company was convinced they were so unique in the way they took and filled orders, that this piece of custom software was the ONLY solution they would ever be able to use.
When I got involved, the team was in a maintenance phase between projects. They were slowly ticking away at the literally thousands of bug reports from customers; addressing them one at a time in no particular order.
We decided to take a slightly different approach and look at the bug lists from a 10,000 foot view. We lumped bugs together into various categories and discovered that record locking was the root cause of MANY of the issues. So we set about creating a standard for record locking and transaction management. We then tackled this one problem everywhere it existed, starting with the most critical code first.
After some time, we had made a significant dent in the code and our tests were looking good. In fact, CPU utilization in the test and QA environments appeared lower. But the real test would be Production.
We rolled the first wave of changes out to Production, and as was tradition, the developers crossed their fingers, waited for the phone calls, and readied themselves to do battle, real-time, with code in Production.
Hours passed without a single complaint. Finally, near the end of the day, helpdesk received a call from order entry.
"Something is wrong with the system." the caller reported, "I think we're losing orders!"
"What makes you think orders are lost?" asked the helpdesk staffer.
"Well", explained the caller, "we usually get a message every hour or so telling us the order table can't be updated because somebody else is using it. This happens anytime several of us are taking calls. We just wait a minute and then hit 'OK' again and it works"
"Yes...?"
"Well, I checked with the other folks here and nobody got the message today. Not one of us. But we were all taking calls and placing orders. The system must be losing some of the orders if it doesn't know we're all putting them in."
Yes, folks, the system was "broken" because it wasn't generating enough errors.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment