This is a follow up to part one
I have made some more progress on this project and we are now down to 60 errors per day from the previous thousand. Given that we are integrating with other a hundred system this is more acceptable. Once a breakthrough was made to clear the highest frequency problem, the rest were fixed easily.
A lot of the existing log messages lacked context. They typically included what method was being used and the stacktrace. What was missing was enough information to recreate the problem.
My previous claim about not using a debugger no longer stands. I was forced to use this to identify paths through some code that had not been developed via TDD (it had some acceptance tests but the details were obscured). We are still removing the old Betamax tests – these consisted of recordings of a production run that are replayed. This is a good start, but is painful to adapt.