First look at Rust

I am starting to study Rust as it is used at work on some projects. So far it has been ok, but not overwhelming.

Rust is a semicolon language, which must be a more important split than the static/dynamic or compiled/interpreted split.

This is coming as a shock after Groovy, Typescript and Elixir.

The Rust compiler messages seem less helpful than I’d expect.

When Log Analysis Goes Too Well

Once upon a time I was working on a system that was deployed on customers sites. That meant we had no regular access to the logs for the system which made getting feedback difficult.

One of the developers on the team added an exception logging table which would capture every exception raised by the system along with the full stack trace.

In order to debug a certain error the customer provided us with a full backup of their DB. In addition to fixing the specific issue I managed to get 2 days to look at the exceptions.

By looking at the frequencies I was able to put in place measures to handle 99% of these errors. Now most of them were transitory problems that the customer had never reported.

When the fixed version of the product was returned to the customer they were happy with it, but could not explain why.
In facts they were so happy that it took 2. years to convince them to take a new version (we had a 6 month release cycle at this time).

Read the Logs

One of my morning rituals at work is to look at a custom summary view of error logs. This allows me to learn what is happening in the system.

Combined with an active drive to clean up the most frequent errors this is a great way to learn what a normal day looks like. Anything new has been caused by a change somewhere or an external failure.

This also gives you a way to estimate how frequently errors are happening compared to the events users are attempting.

Typically I ask in one of our developer slack channels if anyone is aware of the new issue. Sometimes people are already aware of the problem. Frequently there will be further investigation required.

For context I work in a medium sized company with multiple monoliths surrounded by a suite of services. Errors in the logs within my teams scope can be caused by work belonging to my team or more frequently by changes made by other teams.

Note this is mostly about errors. I recommend logging errors when something has gone wrong. This is more important in parts of a system where customers are paying for something. We have two major splits in a system as there are 1000x quotes compared to sales. The quote system should log less frequently than the sales otherwise they will dominate.

Recently I found a new log message stating that manual intervention was required for a process. I had to ask the person who wrote that code exactly what the manual intervention was and who would do it.

I recommend adding links in the log message to a page in the wiki with the instructions to fix. This can start out as a placeholder.

I also find that info or warning messages can be a great way to prove that a change has worked. I recently added a scheduled retry in 24 hours for a weird refund scenario (you can’t partially refund a credit card transaction that is less than 24 hours old…). By logging the retries it was possible to see how frequently this problem happened and how many manual interventions were required (In this case not many and none).

Thoughts on scope of bugfixes

When working on a bugfix do you handle all of the edge cases?

Typically if a system is down you need to do what it takes to get the system back up as quickly as possible without making things worse. This can involve leaving rare edge cases as failure messages with logs that will help fix the problem in future.

This all depends upon the severity of a failure, the speed of deployment and the capacity to fix the problem in the near future.

Sometimes having a manual work around for a rare non-time critical issue is better use of time than overengineering a solution for a problem that may never happen.

Recently I have had to work on production issues that cannot be recreated in a test envionment (without waiting a day or so to set up the test data).

A related type of problem is the bug that could have multiple causes. You think you have recreated the issue fix it only to find it is still broken in production. Having audit logs at info level can help estsblish exactly what was attempted. Event sourced systems can recreate what suceeded, but not always the things that fail.

When logging a failure message always give enough information to locate the error without leaking PII.

Hospitals and Junior Doctor Strikes

I was admitted to hospital during one of the Junior Doctors Strikes. Typically you have a doctor that will refer you to consultants for specialist investigations. It is then upto the doctor to make the decision about what happens next.

Adding a junior doctors strike makes the last part difficult as it is typically Junior Doctors that make those decisions.

In my case there was a 24 hour period in which no decisions were made. This meant I could not be discharged from the Hospital. Eventually I chose to self discharge rather than spend Christmas in a state of limbo (in addition to the sleep deprevation that an extended hospital stay involves).

If the government does not want to settle the strike then something must happen to improve the decision making process.

Logs: Fix known issues or exclude them from the main view

When supporting a system you frequently need a view of the aggregated errors that the system is generating.
However there will sometimes be an error in the logs that you just can’t fix or it is not worth fixing yet.

Exclude these from the main view as they will mask out real problems further down the frequency table.

This is especially important if you are trying to fix the most frequent error each week, If the same error is always there no progress will be made.

Logging Errors: Read them regularly

A log message that is an error is a failure of a system to do something. Typically something can be done to improve matters (a retry, an alert). If you are going to log them look at the logs on a regular basis.

If a client side error was caused from a system outside your direct application it is fine to make another call to store that error,

Give your log messages a useful correlation id. Natural ones are best (the order id being processed) but artificial ones are also useful, if sent through the system.