Refining Craftsmanship through the Fire of Production Support

calebwoods
RoleModel Software
Published in
5 min readAug 22, 2017

--

At RoleModel Software, we’ve always been about Software Craftsmanship and training the next generation of Software Craftsmen. Over the years we’ve developed several tools for measuring the skill level of a developer on the path to becoming a Craftsman.

One of those tools is our Craftsmanship Level matrix, which helps identify the areas in which a person needs to grow in order to take on the responsibilities of later levels. First delivering on simple tasks, breaking down complex problems, and eventually being able to teach others, and lead a team.

This has been a helpful tool, but there still are some additional heuristics that contribute to identifying someone as a Craftsman who can lead a team on virtually any project they are given. Things like project and stack diversity. We’ve recently enumerated some of these heuristics on our website:

RoleModel Craftsmanship Heuristics

Time Builds Learning

Providing production support on systems I’ve built has been transformative in my personal Software Craftsmanship journey. Supporting systems I built and pushed to Production gave me the:

  • chance to see libraries I used go out of support
  • “joy” of framework/language upgrades
  • opportunity to troubleshoot many common performance bottlenecks that don’t appear in development or staging environments

Knowing that real users depend on the system has changed my perspective quickly and concentrated my learning.

To Fork or Not to Fork

With the context of those experiences, it has made me think differently about using a fork of a now unsupported library that solves a challenge in a project versus spending a little more time to write the functions from scratch. If you’re using a forked or unsupported version, it will likely cost you in time during the next major upgrade of the system.

It may also change your evaluation before bringing in a library. Who is publishing this open source library? Does it have backing in the community? Is it well built using tests?

Even when you’ve determined it’s a good idea to bring in a new library, you will also need to consider what refactoring or other cleanup needs to happen first. For example, if you replace your grid system with a tool like Bootstrap, you’ll think about converting your whole application and removing the old dead code. At the time you are bringing in the new tool you have the most context about how both approaches are different. Refactor thoroughly and your future self will thank you.

Avoid Introducing Incidental Complexity

Another principle I’ve learned is that the less code (and moving parts) you have in a system, the easier it is going to be to change and upgrade in the future. This causes you to ask questions like: “Does this app really need a JavaScript frontend or would static, server rendered pages be enough?”

Starting with a more rudimentary solution often allows you to solve the end-to-end problem faster and then iterate on the parts that still need improvement.

Seek to Understand the Full Environment

This also means learning about the complete “stack” or environment in which you are deploying. Not just the language stack, but server and network infrastructure as well.

Here’s another example of a time when production support enhanced my Software Craftsmanship journey. I deployed a new system with an unfortunate bug that caused the system to hang on access a few hours after the deployment. However, a refresh request went through quickly. Unfortunately, this wasn’t uncovered until we were trying onboard the first customer on the system.

I discovered there was a firewall sitting between our application and the database servers. When the firewall detected no activity on an open connection for 15 minutes, it severed the connection with an approach that turned the connection into a packet black hole. Any packets sent through the connection were swallowed and never returned.

Our application didn’t know the connection had been closed and the database connections became unusable. The application eventually killed the connections because they were hanging.

Using a keep-alive packet sent through the connection allowed the firewall to keep the connection open. It took quite a while to uncover this issue but cemented the importance exploring variables outside the system and the value of systems thinking to reduce the number of those variables whenever practical.

Performance is Measured at Scale

In development environments, there is often a very small amount of data in the system. We tend to skip thinking about efficiency when we are querying data from the database or processing some input. However, when the app we are supporting reaches millions of rows in the table we are querying, we quickly see the consequences of our approach. Sometimes that’s the right tradeoff to make at the time, but part of the tradeoff is embedding production monitoring to reduce the risk of being surprised by a performance issue as the system grows.

Always be Learning

Take time to retrospect the challenges that you solve in production systems. Ask yourself:

  1. Knowing what I know now, would I approach the problem differently?
  2. What was missing in our monitoring or logging to catch this issue?
  3. Is this an anomaly — but still useful for my learning?

When you find something concrete, share what you’ve learned. Our team uses a #til (Today I Learned) Slack channel to capture these little nuggets of learning. It might also be helpful to capture them learnings in the project’s README for future developers.

Measure Supportability

The RoleModel team wanted to build a shared understanding of what makes a project easier to support. We built a Support Scorecard and we use it on every project.

RoleModel Software Support Scorecard

We actively report on these metrics to help us to learn from the collective production support experiences of the team and build those best practices into future projects. The higher the supportability score, the less like an surprise issue will come up and the easier it will be to onboard someone new onto the project.

What important lessons have you learned supporting production systems? Please leave a comment on this post so others can benefit from our experiences!

--

--