Read the Logs

One of my morning rituals at work is to look at a custom summary view of error logs. This allows me to learn what is happening in the system.

Combined with an active drive to clean up the most frequent errors this is a great way to learn what a normal day looks like. Anything new has been caused by a change somewhere or an external failure.

This also gives you a way to estimate how frequently errors are happening compared to the events users are attempting.

Typically I ask in one of our developer slack channels if anyone is aware of the new issue. Sometimes people are already aware of the problem. Frequently there will be further investigation required.

For context I work in a medium sized company with multiple monoliths surrounded by a suite of services. Errors in the logs within my teams scope can be caused by work belonging to my team or more frequently by changes made by other teams.

Note this is mostly about errors. I recommend logging errors when something has gone wrong. This is more important in parts of a system where customers are paying for something. We have two major splits in a system as there are 1000x quotes compared to sales. The quote system should log less frequently than the sales otherwise they will dominate.

Recently I found a new log message stating that manual intervention was required for a process. I had to ask the person who wrote that code exactly what the manual intervention was and who would do it.

I recommend adding links in the log message to a page in the wiki with the instructions to fix. This can start out as a placeholder.

I also find that info or warning messages can be a great way to prove that a change has worked. I recently added a scheduled retry in 24 hours for a weird refund scenario (you can’t partially refund a credit card transaction that is less than 24 hours old…). By logging the retries it was possible to see how frequently this problem happened and how many manual interventions were required (In this case not many and none).

Thoughts on scope of bugfixes

When working on a bugfix do you handle all of the edge cases?

Typically if a system is down you need to do what it takes to get the system back up as quickly as possible without making things worse. This can involve leaving rare edge cases as failure messages with logs that will help fix the problem in future.

This all depends upon the severity of a failure, the speed of deployment and the capacity to fix the problem in the near future.

Sometimes having a manual work around for a rare non-time critical issue is better use of time than overengineering a solution for a problem that may never happen.

Recently I have had to work on production issues that cannot be recreated in a test envionment (without waiting a day or so to set up the test data).

A related type of problem is the bug that could have multiple causes. You think you have recreated the issue fix it only to find it is still broken in production. Having audit logs at info level can help estsblish exactly what was attempted. Event sourced systems can recreate what suceeded, but not always the things that fail.

When logging a failure message always give enough information to locate the error without leaking PII.