Read the Logs

One of my morning rituals at work is to look at a custom summary view of error logs. This allows me to learn what is happening in the system.

Combined with an active drive to clean up the most frequent errors this is a great way to learn what a normal day looks like. Anything new has been caused by a change somewhere or an external failure.

This also gives you a way to estimate how frequently errors are happening compared to the events users are attempting.

Typically I ask in one of our developer slack channels if anyone is aware of the new issue. Sometimes people are already aware of the problem. Frequently there will be further investigation required.

For context I work in a medium sized company with multiple monoliths surrounded by a suite of services. Errors in the logs within my teams scope can be caused by work belonging to my team or more frequently by changes made by other teams.

Note this is mostly about errors. I recommend logging errors when something has gone wrong. This is more important in parts of a system where customers are paying for something. We have two major splits in a system as there are 1000x quotes compared to sales. The quote system should log less frequently than the sales otherwise they will dominate.

Recently I found a new log message stating that manual intervention was required for a process. I had to ask the person who wrote that code exactly what the manual intervention was and who would do it.

I recommend adding links in the log message to a page in the wiki with the instructions to fix. This can start out as a placeholder.

I also find that info or warning messages can be a great way to prove that a change has worked. I recently added a scheduled retry in 24 hours for a weird refund scenario (you can’t partially refund a credit card transaction that is less than 24 hours old…). By logging the retries it was possible to see how frequently this problem happened and how many manual interventions were required (In this case not many and none).

Thoughts on scope of bugfixes

When working on a bugfix do you handle all of the edge cases?

Typically if a system is down you need to do what it takes to get the system back up as quickly as possible without making things worse. This can involve leaving rare edge cases as failure messages with logs that will help fix the problem in future.

This all depends upon the severity of a failure, the speed of deployment and the capacity to fix the problem in the near future.

Sometimes having a manual work around for a rare non-time critical issue is better use of time than overengineering a solution for a problem that may never happen.

Recently I have had to work on production issues that cannot be recreated in a test envionment (without waiting a day or so to set up the test data).

A related type of problem is the bug that could have multiple causes. You think you have recreated the issue fix it only to find it is still broken in production. Having audit logs at info level can help estsblish exactly what was attempted. Event sourced systems can recreate what suceeded, but not always the things that fail.

When logging a failure message always give enough information to locate the error without leaking PII.

Hospitals and Junior Doctor Strikes

I was admitted to hospital during one of the Junior Doctors Strikes. Typically you have a doctor that will refer you to consultants for specialist investigations. It is then upto the doctor to make the decision about what happens next.

Adding a junior doctors strike makes the last part difficult as it is typically Junior Doctors that make those decisions.

In my case there was a 24 hour period in which no decisions were made. This meant I could not be discharged from the Hospital. Eventually I chose to self discharge rather than spend Christmas in a state of limbo (in addition to the sleep deprevation that an extended hospital stay involves).

If the government does not want to settle the strike then something must happen to improve the decision making process.

Logs: Fix known issues or exclude them from the main view

When supporting a system you frequently need a view of the aggregated errors that the system is generating.
However there will sometimes be an error in the logs that you just can’t fix or it is not worth fixing yet.

Exclude these from the main view as they will mask out real problems further down the frequency table.

This is especially important if you are trying to fix the most frequent error each week, If the same error is always there no progress will be made.

Logging Errors: Read them regularly

A log message that is an error is a failure of a system to do something. Typically something can be done to improve matters (a retry, an alert). If you are going to log them look at the logs on a regular basis.

If a client side error was caused from a system outside your direct application it is fine to make another call to store that error,

Give your log messages a useful correlation id. Natural ones are best (the order id being processed) but artificial ones are also useful, if sent through the system.

Thoughts on Alexa Skills

My employer is having a Hackathon and the team I am on is investigating Alexa skills.

The plan was to use the email address of the Alexa account to match against the email address of the customer we already have. It took us all a little while to get a hello world Alexa app stood up. Alexa has been around long enough that there are now more wrong setup articles than good ones.

The Amazon documentation is lacking in actual code samples alongside the documentation.

It does have a full setup including it’s own github build pipeline and deployment tools. The downside is that this makes it more difficult to integrate with your own build tools. For example it wants you to push.to master on it’s repo, yet ours has that name restricted, so we need to branch again to push it to github.

The debugging experience is painful. The default logging system is very weak. A simple list of log events that is hard to search.

The simulator is missing key features. You can’t trigger permission requests and there is no way to bypass these. You need permissions to fetch a basic email address, and getting this involves an interesting dance with various other services.

Let’s see if this gets better on Day 2.

Why Loki does not break the Time Travel Rules of Endgame

Professor Hulk gave a long talk on the rules of time travel in Endgame. These were based upon his logical reasoning. It is possible that those limits were enforced by the method of Time Travel used: move to parallel universes and you are not going to have consequences on your own.

However Loki using TVA gateways did manage to move backwards on the single timeline and have a conversation that altered the future.

I suspect that The Hulk only had logical reasoning to define the rules, where as Loki found out by experience.