Modern testing for modern stacks

We have gotten into the habit of thinking deeper about one topic on a weekly basis. We pick topics based on anything interesting we read - so the topics can range from 'how to express the value of testing' to 'Dieter Rams' design principles' to 'effective remote work habits'. Employees are guided to spend no more than one hour researching the topic online. The emphasis is on coming up with their own ideas and interpretations. We then meet as a group to exchange ideas. I love this habit and consider it one of the more unique benefits you will enjoy at Qxf2.

Topic: Blame and train response to failure

04-Jul-2017

This article is a nice explanation of the differences between a brittle system and a resilient system and thoughts about blame-and-train.

References:
https://www.snafucatchers.com/single-post/2017/06/04/BorkedTheDatabaseCase

Our thoughts

Indira

I agree with many reddit commenters that leaving a production system exposed to a developer that too with a poorly written guide will obviously lead to this kind of situations. Blaming a person for an accident like this may not be the correct approach and it will not reduce the brittleness. In this case, i feel that there are many loop holes in the infrastructure itself, like developer having the Prod DB passwords on their Dev guide, even if they have the access it should be read only access. Production data should be secured and no one on the dev team should have direct production access. Only people that need to know can access it. In most of the cases, i agree that blame-and-train is the most common response we get, but this may not be the case always, there could be examples where the employee is at fault but the results are not severe. Mistakes happen and we learn from mistakes. The reaction to such mistakes is important. The focus should be on fixing the problem, not blaming the individual. Entire system needs to be analysed before you take a step. I agree on the productive reactions to failure, these techniques will help us to analyse deeper.

Avinash

Many IT organisations are prone to such failures. I have experienced this situation couple of times in my career. One of the scenarios was when someone cleared some queues and deleting some table from databases in the production environment. In most of the cases, no one looks at why the mistakes may have happened. Instead, they try to find and blame the person responsible for doing it. We also find the blame game going on between testers and developers as to who missed the issue, instead of trying to find the conditions that gave rise to the specific situation. This is a very important thing to think and consider about in case if we want to reduce similar scenarios happening in future.

Annapoorani

Organizations invest in both technical and human resources to build and maintain adaptive capacity.But main complexity is how can we respond to incidents. The reactions to such failures varies but is often some version of blame-and-train. We want our IT to not fail in such brittle ways but we are not yet sure how to create and sustain systems to achieve that end.If there is any error we have to find the root cause of the error.Here if they find any error they train the users to face those instead referring the original cause.If there is any failure identify similar situations or conditions and find ways to enhance their ability to do this.

Smitha

System failures are common in IT and there's a typical pattern of people getting blamed or fired. These sort of reactions refer to blame-and-train. There have been lots of problems with blame-and-train. It is now ineffective. I have seen this in my career when one of the pilot websites that we launched was so ineffective that we pulled it back. Offcourse, there was an alternate system that had been in use in place of the new one though. The steps taken for the after failure matters a lot. The author has included a few good ones but Im sure theres more to it.

Shiva

A nice article emphasizing how Blame and Train could be an impediment towards building resilient systems. It is essential to find the root cause of any problem than blaming the person accountable. Although this may not be the case every time there could be instances where the employee is at fault but analyzing the issue regardless might reveal the brittleness in the system followed. In this particular case blaming the employee and updating the training modules would help in the short term but the ultimate issue here is that the IT infrastructure is so brittle that it is easy for anyone to cause series problems. One productive reaction to this problem can be to use the 5 why technique to get to the root of this problem.

paper cut