One of my morning rituals at work is to look at a custom summary view of error logs. This allows me to learn what is happening in the system.
Combined with an active drive to clean up the most frequent errors this is a great way to learn what a normal day looks like. Anything new has been caused by a change somewhere or an external failure.
This also gives you a way to estimate how frequently errors are happening compared to the events users are attempting.
Typically I ask in one of our developer slack channels if anyone is aware of the new issue. Sometimes people are already aware of the problem. Frequently there will be further investigation required.
For context I work in a medium sized company with multiple monoliths surrounded by a suite of services. Errors in the logs within my teams scope can be caused by work belonging to my team or more frequently by changes made by other teams.
Note this is mostly about errors. I recommend logging errors when something has gone wrong. This is more important in parts of a system where customers are paying for something. We have two major splits in a system as there are 1000x quotes compared to sales. The quote system should log less frequently than the sales otherwise they will dominate.
Recently I found a new log message stating that manual intervention was required for a process. I had to ask the person who wrote that code exactly what the manual intervention was and who would do it.
I recommend adding links in the log message to a page in the wiki with the instructions to fix. This can start out as a placeholder.
I also find that info or warning messages can be a great way to prove that a change has worked. I recently added a scheduled retry in 24 hours for a weird refund scenario (you can’t partially refund a credit card transaction that is less than 24 hours old…). By logging the retries it was possible to see how frequently this problem happened and how many manual interventions were required (In this case not many and none).