These are the changes you make because a system does something that annoys you.
Month: September 2024
Expanding Practices From Personal to Team to Company.
I have a personal practice at work of looking at the error logs of our system each day. This leads to actions to make small improvements.
At the start of this year my engineering manager encouraged me to expand this to a team activity. We meet once a week and look at the highest frequency patterns. These are turned into either tickets for us to fix or requests for other teams. We have halved the volume of errors. The logs are now mostly free of noise and we are clearing real errors, some of which have been in production for years.
The next step is to expand this to the entire company.
The key point is that an observability platform is more effective if everyone is looking at it every data. I aim to inspire curiosity: what has caused this error and do I need ho fix or ignore it.
The basis for this is a simple dashboard in Datadog. List error logs for a number of systems. Look at errors in production grouped by patterns. Add exclusion rules to ignore things that are too expensive to fix now. View over a 1 day time window. Sa e as a favourite.
Look at this every day. Keep stats in a spreadsheet – datadog has a limited time window.
Analyse it every week.
Fix issues raised.
The first point people raise is to automate this. Resist this urge as it is the looking at it regulatly that is the point.
Logs are great for this as it is always possible to drill down to find a specific example to look at.
As a side effect the team will become better at writing log messages and using the observability tools. They will learn more about the systems they are supporting.