Only fix what's broken (But fix that one right)!
I have often been called into projects to investigate performance problems or to optimize a system for performance. What I find is that performance problems can have a lot of causes. Some are design problems, for example using the wrong algorithm for a problem, or processing items one by one instead of by collection. Others are database problems caused by bad access patterns. Network problems add their share. With the growing complexity of modern distributed systems (think cloud!) the situation is getting even worse. More and more systems and services are being combined to provide the business functionalities. Each one could be the cause of a performance problem.
The analysis of performance problems, however, can be rather difficult. For various reasons I find that performance analysis has normally not been prepared for. Most of the times only the normal application logs exist as means of understanding what has happened. That performance problems often occur intermittently does not make the situation better.
So here's what you can do to prepare for performance problems and when analysing a problem:
- Make sure you understand the performance requirements and the system behaviour under load. Create a performance model that helps you understand how the system behaves under increasing load. This may give you an outlook, e.g. during a roll-out, if your planned infrastructure is sufficient for example. It also gives you measurable goals to evaluate your performance efforts.
- Instrument your code, for example create a decent performance logging: on each "relevant" boundary (on request entry, remote calls, or layer boundaries) write a performance log, i.e. how long the request takes. Make sure you can track single requests, or at least a session across system components. For example create a request ID and write it into every performance log entry. This way you can identify, follow and understand what happens in a long-running request - even proactively.
- For those that don't work in an DevOps environment, but maybe for them as well: get to know your operational monitoring tools. This includes JVM profiling, database monitoring and so on. Get to know the people who can (are allowed to) operate it in the production environment. "A day at the operations" can really improve the understanding on both sides and improve communications when a real problem occurs.
When a problem occurs:
- Keep calm - try not to jump to conclusions when there is a real performance problem. If you don't know the cause of a problem, any random rash action may only waste time, or even make the situation worse. Use the analysis tools you prepared to get a better understanding of the situation. Try to find the root cause.
- Fixing symptoms is ok in the short run, but a root cause fix has to follow. Otherwise you're just ramping up technical debt and the cause may at some later time reappear even worse. Make sure fixing this technical debt is not forgotten. Even if a problem only appears once and goes away "on its own", it might be a symptom of a bigger, hidden problem.
When the system is "back to normal":
- if there were procedural errors identified as (part of) the cause of the performance problem, follow up with a post-mortem discussion in your team. This includes programming errors which may result in updated programming guidelines or code quality checks (you have these, right?). But it may also result in updated operating procedures for the production team.
- Pay the technical debt. This includes planning for appropriate improvement work into a normal release. It's sometimes difficult to get such a "technical" work into a "functional" release, but you have to make clear that the without fixing the root cause, a problem may appear again and cause even more trouble. Remind them about the outage cost versus the fixing costs.
I hope this helps you fixing your next performance problem!