Observability - The New Focus

I’m changing the focus of this newsletter to be about Observability. If you don’t care about observability after reading this, please unsubscribe. 🙂

John Gallagher

Apr 07, 2024

The problem

Imagine a company that writes a CRM.

Here’s how developing a new feature might look:

Customers ask for a feature “Please can I have email notifications when a contact messages me?”
Business gives it a green light. Because they like the idea. “Email notifications will be great! I love it when I get notifications!”. “But will it move the needle?” “Who cares! Email notifications baby!”
Product owner makes a list of tasks for how to approach the work over many months. It looks a lot like a gantt chart, but it’s not because it’s in Google Sheets you see. The product owner has no engineering experience, no domain knowledge and has no real understanding of what’s possible. They will plan everything out because building software is just like building a house - you make the plans then build it.
User experience designers create some pretty pictures. They look good. There’s not much thought put into making it interactive. That’s a problem for the engineers. No wireframes. No information architecture. There’s little to no user testing. After all, users make things so messy and designers like staying in Figma - it’s such a cool app. When they’re done designing it, they’re really happy. The designs are copy pasted into cards that will be handed off to engineers.
Engineers break down the list of tasks and guesstimate points. After the first week of tasks is planned, they’re guessing and they know it. The points are nonsense and we’re all going to realise that soon enough.
Engineers take the tickets and work on coding up the solution. The designers love seeing engineers taking their ace Figma design and making it work. The engineers leave 14 comments on their carefully built designs pointing out problems. The designers push back and eventually the engineers just do what they ask, even though it makes the code more difficult to manage.
When it comes to the functionality, engineers go fast and break things, because that’s what Facebook do. They hammer in the features as fast as their fingers will let them. They love building! They’ve heard something about “clean code” but they don’t have time for that. Plus they can clean it up later.
They work with PRs because the need code reviews and it’s a best practice. Everyone reviews PRs in their own style at their own pace. Some are thorough, others have a shortcut made for “LGTM” because that takes too much time to type. Messages fly around Slack asking for approvals.
All the engineers work independently because that’s how you make teams go much faster. Plus, the engineers are all introverts and who wants to sit in another meeting?
No-one in the business looks at what the engineers are making. After all, the engineers don’t make the effort to talk to them. Ever. The business leaves them to do their super smart coding thing and they’ll review what they’ve done later. Email notifications will be easy - every app has them.
Every engineer has a different coding style. Some people write tests that are barely readable. Others swear by mocking. Others write the odd test that tests nothing. Still others write no tests at all. Missing tests aren’t too much of a problem. We don’t have time to write them, but they’ll do it later.
When tests fail, engineers spend hours fixing them using the debugger because it’s difficult to understand what’s going on. Engineers who don’t like writing tests think this is ridiculous. What’s the point in being slowed down? All these tests do is add extra baggage but someone on the Internet told these engineers to write them and so they do. Ah well, group think!
The code uses a lot of inheritance because that’s what’s great about object oriented programming. Class names are based on the design pattern the engineer has just learned that weekend so that other engineers can understand the code better.
Three weeks in and it’s clear someone didn’t do their planning correctly because cards that seemed easy and were given 1 point seem to be taking forever. Every card seems to breed three more. The product owner is increasingly stressed that the team is falling behind.
The existing code tricky is to change and they’re behind, but that’s OK - the engineers have a plan - copy and paste existing code then tweak it. Subclassing is a great idea to reuse existing code too.
The engineers have read on the internet that accumulating tech debt is sometimes necessary and the timescale seems impossible, so now is the perfect time to take some short cuts. They’ll can make it correct later. Even the engineers who are keen on writing tests aren’t doing so now.
When the feature is ready after 2 months, someone in the business who isn’t as smart as all those engineers takes a look. First impressions? It seems a bit weird. The email notifications seem to have overlapping rectangles in Gmail. The font is strange and the colours are garish. The settings screen looked good on paper… but when you use it, the buttons are glitchy. It wasn’t clear on the design, but the settings hide other parts of the page that looks weird and unintuitive.
The business feeds this back. The product owner is nervous. They tells the business that they’re really behind so they won’t be able to fix most of this. They promise to fix the overlapping rectangles at least as that’s the most important issue.
When the engineers see bug tickets in the backlog they’re not happy. It feels like they’re going backwards. The first issue to tackle is called - “Fix overlapping rectangles” - it has three paragraphs of waffly text in the description. When the engineer tries to reproduce the issue they can’t and they tell the product owner as much. The card is moved to blocked waiting on investigation. After 3 days of trying to fix it in their local dev environment the engineers give up.
The product owner goes back to the business with their tail between their legs. “Sorry, we worked on the overlapping rectangles and couldn’t fix it. We’ll fix it later. Both the PO and The Business knows that this won’t happen. But they both nod their heads and agree to not rock the boat.
The feature enters QA. After a few days, the testers have found 78 bugs. The testers are delighted. Everyone else is pissed.
The product owner spends 2 days putting all 78 bugs into Jira, complete with their position on an effort / impact matrix and story pointing them all. Jira already has 1425 tickets in the backlog, so those extra 70 floating around in there have lots of friends to keep them company.
Time is getting tight and the business are leaning on them heavily now - this feature is turning out to be very expensive - so they fix the most important 8 bugs and leave the rest - they’ll fix those later. After all, every bit of software has bugs.
None of the 8 bugs have a regression test written for them because they’re getting short on time. Besides, the engineers are pretty sure those have been fixed now. When the product owner points to a bug they’ve come across, the engineer defends their work - “it works on my machine”. The product owner gives up. Bigger fish to fry.
Finally the feature is released. The team merges the last PR, waits 28 minutes for the tests to pass. The moment the commit is released the team high fives. They did it! They celebrate their win by posting onto Slack and everyone chimes in adding emojis. There are 21 emojis! Wow - this feature release is a huge success. Everyone needs a 2 week holiday. But it’s done.
The team wants to do the cleanup - there are so many outstanding issues. But sadly, that time that was allocated doesn’t appear. They have 2 days to recover then it’s time for another urgent feature.
Whilst they’re developing this next feature, the team start getting support requests mentioning the email notifications. Seems like some customers are not getting notifications. More concerningly, some customers are being repeatedly spammed every hour. The app really shouldn’t be doing that.
The engineers look into it. The error shows in Sentry - it’s buried under 534 other errors so they didn’t notice it. One engineer tries to figure out what’s going on. The error contains almost no useful information. Just a cryptic message - “Email wasn’t sent” - with a stack trace. The stack trace points back to a try / catch block that swallows the error and logs a generic message.
The engineers can’t replicate this locally. They’re in trouble and they know it. There are now 132 support requests with exactly the same error. There are over 700 errors in Sentry now. The pressure mounts.
They redeploy the code, taking the try catch out. The error is hardly more useful. “SSL disconnect - retrying…”. They open New Relic as a last ditch attempt - no-one really looks in there much - and they go to the logs. Nothing of any use - just a bunch of text statements.
48 hours after the first support request, they’re still no nearer to debugging the problem. They’ve added four log statements that point to some kind of strange timeout with the email platform. The CEO is frantic and hounds the product owner for updates. The engineers are stressed out of their minds. The testers feel bad.
4 days after the initial storm, they figured out the issue. They decide to patch it and move on. They’re already late with the next feature after all and stakeholders aren’t happy. They decide to fix the other Sentry errors later.

None of this is joyful.

Nice rant. What’s the point?

In the story above there are a lot of antipatterns. I’ve never seen all these at any one workplace - this is deliberately exaggerated. But hopefully you’ve seen enough of these problems to be able to identify.

I’ve tried to challenge some of these anti-patterns throughout my career and I’ve not been massively effective.

I’ve tried to educate on the merits of software design. It’s my lifelong passion, but it seems to be an uphill battle to persuade folk that this is something worth investing in. I also find software design to be a very difficult discipline to measure the impact of.

Observability is my new focus

However, I’ve seen that improving observability is something I can have a measurable impact with. I’ve learned how to take a Rails app that’s a black box and turn it into an app that you can at least understand some of the basics.

I’ve seen the value of this on other engineers and on the business.

Everything I’ve learned I’ve learned on my own. As far as I’m aware there’s next to zero information on the web for how to do this in Rails properly. And there’s almost zero information for other frameworks too.

The articles I’ve found about structured logging are utterly useless. They just about explain what it is, give one example in Java then treat the problem as “fixed”. Hilarious.

Other articles on observability are also useless - they talk about the three pillars (if I read another generic explanation of logs traces and metrics, I will go actually mad) and “why observing in production is really important” but give zero real life examples. No useful advice. No details.

How do I understand my background jobs?

How do I understand the errors in my Sentry?

How do I track user behaviour?

What conventions should I use in my structured logs?

“Oh, just install New Relic / Datadog / Prometheus.”

OK… and then what?

It’s like asking “How should I invest my money?” and being told “Well, there’s stocks, bonds and Crypto. They’re volatile in different ways. Make sure you save enough for retirement. Best of luck!”

There’s now an excellent book - Observability Engineering: Achieving Production Excellence - by Charity Majors, Liz Fong Jones and George Miranda. This seems like a significant ray of light at the end of this very dark, very depressing tunnel of black box footling around, blindfolded by lack of accurate data… or any data at all.

I want to be crystal clear - I don’t see myself as any kind of expert. I’m an expert in observability in the same way as a 12 year old is an adult. Compared to a toddler, maybe. Compared to a real adult? Not a chance.

The Rails apps I work with are, after 2 years of part time observability improvements, are at 40% of where I want to be.

I’m obsessed with this topic and I’m always pushing the limit of what I know to deliver more value in observing an app in production.

I’m on a journey to learn this too. Come along for the ride!

I’ll share all my learnings here

The learnings I’ll share with you in this newsletter are a tiny fraction of what I think the standard for observability should be for every app we have with any significant workload in production

I’ve talked to some engineers who seem to be quite happy with what they call “the basics” in place, which really means “we installed the standard tools, added lograge and I suppose it’s good enough for now” which really translates to “we don’t have a clue how to debug anything in production but we just create a ticket to add more logging and the bug gets punted off. Plus, we’re onto the next feature now”.

I’m not sure if it’s that engineers don’t care that users are having a terrible time, don’t have bandwidth to take on more work or just don’t even have headspace to think about it.

In any case, the chaos caused by this complete lack of interest in how the app is actually working in production is incalculable. Apps crash, users lose data, the app randomly freezes on a Tuesday, DDOSes, hackers, scams, overage bills, huge P95s - and on and on.

At one point I thought I wanted to build features. Now I know different. I want to help you understand the features you already have.

Please reach out if you have any questions or comments.

Thanks for reading. Onwards and upwards!

PS - A HUGE thank you to Irina Stanescu who writes The Caring Techie for inspiring me to start writing again. You rock.

Joyful Programming

Discussion about this post