Critical Lessons From Fixing Bugs In Production For Years

Critical Lessons From Fixing Bugs In Production For Years

I have two more days on my current job and I already know I’ll be missing it as much as I can’t wait to get out. Why? Simple, because it is unlikely that I’ll ever have this kind of personal freedom and hectic, end-to-end responsibility for a system again. It’s a dinosaur concept that is rightfully getting phased out in the tech world at large.

It is also unlikely that I’ll be working on a system that requires constant bugfixes in the production environment. The things that nightmares are made of for others were my daily work — and it was fun, quite so.

So here is a little eulogy to a job I know I’ll miss:

Critical bugs are often impossible to fix in test

My company can’t be the only one that works on a weekly, maybe even monthly database dump from production to test. In fact I know that there are companies out there where the developers think “having a test system would be nice” even in 2021.

This weekly dump basically means that there is exactly one half-day during the week where my test system matches the state of production — and that means any data-related error that comes up and needs to be fixed asap usually requires me to fix it in production.

You need to understand the whole system front to back

In fact, you will develop a deep understanding of the whole system and structure front to back, but also top to bottom. You’ll learn the triggers and the output, you’ll know how the underlying database tables are structured and how to manipulate them to fix the data issues.

You’ll learn the most common sources of errors, might even be able to fix some of them but usually these errors result from imports, formatting, unicode vs. UTF8 and uncommon characters — all of them generated by sources outside your control. So you might set up automated database jobs that clean this data or run workarounds, you’ll implement cleaning solutions in your application, or you might just set up an automated email to send messages to anyone who has the power to fix things.

This deep understanding is fun and powerful and it goes much deeper than programming often is for beginners. I was thrown into this world with little warning after the previous developer was fired and the still-apprentice was the only one to pick up the shovel. Nobody else wanted a piece of this pie. This is not normal for beginners and as much as I struggled in the beginning as thankful am I for the chance in retrospect. You learn so much in so little time when a bug in your system is costing a thousand bucks per day of not getting fixed.

Most programming these days tries to abstract everything as much as possible, it is common to abstract database access completely behind a layer of data mappings, OR-mappings, whatever you want to call it. It is common for developers to hate writing SQL code as much as I love it.

As long as everything works well that approach is perfectly fine, no need to reinvent the wheel. The problems start as soon as something out of your immediate control goes wrong — and when everything is out of your control you will quickly get eaten by unsolvable problems.

The best program would be just a chain of imported packages and standardised function calls — but good luck fixing something like that. You’ll likely work with version pinning 100% of the time and then you constantly need to update and check things.

You can control damage by doing everything twice

You know what sucks? Updating a line in a database and then realizing in an oh-shit moment that you updated the wrong fields, no way back. You know what’s worse? Realizing that you just did that for a hundred lines, a thousand.

That’s why it is so useful to run every update twice, once in a limited quantity that allows you to verify the results and then again at scale.

I often had help here from non-IT staff, they needed something fixed, I did it to one customer record, they verified the result and I pushed the button again without the limitation.

Not everyone is made for a job like this

I loved this job, the constant bush fires to quench, the fancy building with the rotting wooden beams in the basement.

Others hated it, in my time at this company I was usually responsible for everything while people came and went. The last two years I was completely alone. I lived through five developers in as many years, not including the one that got fired before me who had simply given up and waited for the day when others realized it was time to let him go.

And honestly I am now at the point where I’m glad I get to leave, I like the idea of not working nights and weekends and actually going on vacations rather than day trips. I also miss that fiery redhead of a codebase, it’s hard to go back from that to stable programming routines and highly standardised workflows.

Related  Programming Is Like Sex

Programming has little to do with coding

This is something that gets talked about a lot, but it is true and should be talked about even more. Sure, in order to build and maintain programs you need to write code — but that is maybe a tenth of the work that goes into it.

Towards the end I sat in meetings more than I wrote code, I did more database work than committing changes to our build pipeline. I did more project work than bug fixes.

The actual solution to a bug is rarely complicated, more often than not it boils down to a single line of code. We fixed a massive performance issue by changing the order of a LINQ statement once.

Finding those bugs takes way more time, in fact I would say that I spent more time debugging than building in the past eight years. You develop routines, you understand the system flow, save time with cleverly placed breakpoints. So much that goes into running a system, suddenly you sit there debugging server issues on a virtual machine wondering how on earth this is your responsibility. You communicate, get help, give help, you are nothing without that social network that develops over time.

Silo knowledge is dangerous and needs to be avoided at all cost

I say this as a direct profiteer of being the only one responsible for the system and thus having near-infinite job security. It also sucks to be the single point of failure and for the company it sucks to have one. The company can count itself lucky that I leave on good terms, I could have been here today, somewhere else tomorrow instead of putting in five more months to even attempt a proper hand-over of everything I know.

And it’s not like this surprises anyone, but it still happens by default even with people who are open to sharing. We all have our own jobs and specialize in them to a degree — but the extreme needs to be avoided. A silo full of grain can quickly rot and make the whole content useless. A developer can crash on their motorcycle, can become sick, be unavailable to work and then suddenly everyone realizes that it might have been smart to have a replacement ready.

And as a developer I believe the worst thing you can do is to sit on your knowledge in hopes for job security and be all secretive. If you hold passwords and knowledge hostage I consider you scum, simple as that. Transparency is a powerful shield and sword and you can achieve the same kind of job security by sharing as by hiding — while everyone likes you better for it.

Just because you skip the test system does not mean you get to skip good coding practices

A lot can be written about proper coding practices but to me there is a handful that make all the sense on a practical level and not just on paper:

  • Automate build pipelines and skip the hassle of manual deploys
  • Make incremental commits so you always know which change broke what instead of guesswork on a larger release
  • Code reviews are nice if you can get someone to do them
  • Spend time on making your code clean rather than just getting it to work
  • Do all the things as proper variable naming schemes, code styles — a really nice tool for this is stylecop that prevents you from building unless your code looks good. It’s annoying, but immensely helpful.
  • Set up daily monitorings and error alerts so you even know when an error arises.
  • Extract controlling variables and settings into a config file (and add that to version control!) — that way you can change the system without needing a full redeploy. This can save you an hour of time on each change if you go through the usual pull-request-test-production deploy cycle.

If you keep track of those you’ll produce better code, faster and everyone will profit — most of all yourself as you come back in six months to fix your own code.

Takeaway: Fixing bugs in production is sometimes the only choice

My situation is not uncommon, only a bit more pronounced thanks to the forward-facing nature of the system where errors arise noticeably and customer-facing and thus need to be fixed asap.

Any kind of data-related issue usually needs to be fixed directly on the production servers or left lying around until the next database dump. Now obviously you could work with daily dumps if that is feasible, but likely it would create overhead of its own and also create a whole host of new problems due to a lack of consistency.

While I would have found daily dumps useful others would have hated them as their work spanned multiple days and would have required them to save and run their scripts each morning. Data consistency and actuality are both important and the balance point is impossible to find.

If you enjoyed this post I have written others that you might enjoy:



No Comments Yet!

You can be first to comment this post!

Post Reply