Embrace Your Inner Engineer

Alex Aitken
codeburst
Published in
7 min readOct 23, 2020

--

When I first started at Traveloka, I observed some good engineering practices but saw that other practices could be improved. While 2020 has thrown some curveballs into the mix, it also created a need for innovation and this challenge has been key to us figuring out how to empower engineers to work productively from home.

In therms of what I felt could be improved, one of the first things I noticed was a lack of centralized documentation. Without comprehensive documentation, how could we onboard an engineer remotely? In such a situation, we don’t have the luxury of a whiteboard to show them how things work. We also didn’t have the ad-hoc time to ask questions and jump into peer programming. So, what I introduced was the RFC process. I wanted a way to centralize everything, so I suggested that we use the company wiki. The way I introduced the RFC process was by writing the first RFC myself and introducing the format. Here’s a sneak peek:

# RFC-NNN - [short title of solved problem and solution]* Status: [proposed | rejected | accepted | deprecated | … | superseded by [RFC-005](link to RFC)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->
Technical Story: [description | ticket/issue URL] <!-- optional -->## Context and Problem Statement[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]## Decision Drivers <!-- optional -->* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
* … <!-- numbers of drivers can vary -->
## Considered Options* [option 1]
* [option 2]
* [option 3]
* … <!-- numbers of options can vary -->
## Decision OutcomeChosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].### Positive Consequences <!-- optional -->* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
* …
### Negative Consequences <!-- optional -->* [e.g., compromising quality attribute, follow-up decisions required, …]
* …
## Pros and Cons of the Options <!-- optional -->### [option 1][example | description | pointer to more information | …] <!-- optional -->* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
### [option 2][example | description | pointer to more information | …] <!-- optional -->* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
### [option 3][example | description | pointer to more information | …] <!-- optional -->* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->
## Links <!-- optional -->* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->

To be clear, I didn’t come up with the format above, I actually took it from here. It’s a nice concise format with the pros and cons of each of your considered options and the outcome. When writing an RFC, you should have multiple approaches to the problem, and the team will collectively agree on the way forward.

The Pros of a Post Mortem

My RFC suggestion is far from perfect. There have been several suggestions for the process and several comments. For example, a common complaint is that the wiki is not interactive enough and that Google Docs offers a better collaborative environment. We’ll work together to come up with an incrementally better solution for the future.

Another problem we faced was that we weren’t taking care of our services. We were having incidents but weren’t really following up on our post-mortem process. As a result of this, we saw repeating incidents without any clear action items. One of the first things I did after an incident was host a post-mortem retrospective. What I tried to facilitate was a blameless learning process. I wondered how we could improve and learn from our mistakes, whether we could do better in the future, and what were the processes that failed.

By having this retrospective, we made sure that our action items from each post mortem made sense and that we also investigated the root cause of each incident. This solidified our knowledge and shared it among the team (not just our engineering team, but also our product, quality, and business teams).

One thing that did come out of these retrospectives is that engineers did not understand why we needed post-mortems. Why should we go over what happened in production? Why create all this documentation? A common reaction was that engineers felt like the blame was put solely on them, especially as the business started to look into these post-mortems and senior technology stakeholders were interested in the outcomes. To answer why, we need to explain that understanding failure is a fundamental part of learning and improving. By reviewing where we went wrong, we could figure out what we to do in order to prevent similar issues in the future. If you never share those concurrency problems, how will others know to review their systems and be vigilant around issues that might appear one day on their side?

Beyond the Post Mortem

Google has an impressive SRE culture, and — from that — an impressive post-mortem culture. Amazon also has great attention to detail regarding post-mortems (from what I’ve heard). While we may not be Google or Amazon, we owe it to ourselves to embrace our inner engineers, figure out the root causes, and learn from our mistakes.

Something we recently started to help with our technical operations was a bi-weekly technical operations sync. This is only for my teams to make sure that they are monitoring the right things and keeping track of achievements and learnings.

I was actually inspired by Sebastian, who told us about how he handles his teams and keeps them focused. While his format and methods are slightly different, I agree with many of his teams’ practices. It helps keep teams operationally focused and prevents them from losing sight of how our services are running.

The meeting agenda is as follows:

Hi team,

What: To help us become more technical operations focused, I’ve created this bi-weekly meeting for just the teams under me (to begin with). What we will be focusing on is how we can improve our operational excellence. This includes things like dashboards, alerting, automation, and processes.

Who: You may nominate someone from your team to take your place each session and report back to you.

Agenda:

Wins: This is where we may talk about improvements that we have done over the last two weeks.

Did our availability go up?

Did we create a new alert?

Did we change a process within our team?

Retrospective: This is where we can all learn from our failures.

Were there any incidents over the last week?

What were the MTTD and MTTR of the incidents?

What were the learnings from the incidents?

Deep dive: Each session we will deep dive into a team’s technical operations. I will do a round-robin of each subdomain/team to begin with, and then we will randomize it. Be prepared.

What are your dashboards?

What are you measuring?

How do these dashboards help you solve incidents?

What are your key metrics?

How do you run the on-call process? Is there a playbook?

What logs do you turn to? How noisy are they?

What alerts do you have?

How much does on-call impact your work?

Are you often paged outside of work hours?

MoM: Link to Google Docs

Thanks,
Alex.

I want to give a shout out to all my teams. Since we adopted this meeting format, we’ve been looking at ways to measure our availability, our services, and why we have alerts that were set up two years ago. Backend engineers are beginning to understand what and where they need to look for information on the frontend and vice versa. We’re creating lines of communication and thought that did not exist a few months ago.

Lastly, I want to give you an overview of how we handle our backend on-call engineering handover during this pandemic and remote time. We noticed that engineers are often given ad-hoc tasks from the product, customer support, or other teams (especially if you are an enabler team used by other teams). What I set out to do was create a brief on-call log. This idea actually came from Anshul, for some of the other teams in the fintech domain. But I didn’t want to complicate my team with a formal process.

In my Rewards backend engineering team, we have a weekly on-call round-robin schedule. This ensures that ad-hoc tasks don’t overburden the engineers and that they also have time to investigate and fix critical issues. The on-call log is a really informal document that lists what each engineer has done during his (or her) week. For example, did you get four ad-hoc tasks from product admins to inject or query some data? The more we can see these patterns, the more we can look at automating these and justifying the opportunity cost to the product. We don’t yet have a formal handover or review process for the logs, but I imagine that this will take place once a month when we have some more data to go over with the team.

Conclusion

Engineering is so much more than programming. It’s about finding the little things that we can improve upon, and keeping an eye on our systems. It’s about technical and operational excellence in what we do. It’s about learning and sharing that knowledge with our team. By choosing to practice, we bring out the inner engineers in all of us.

Originally published at https://www.alexaitken.nz on October 23, 2020.

--

--