[00:00.000 --> 00:12.240]  So, hello, everyone. Thanks for joining us today to the Postgres Dayroom. Our next speaker
[00:12.240 --> 00:17.800]  is Chris Trevers, who flew all the way from Indonesia. Thank you. And he's going to talk
[00:17.800 --> 00:23.680]  about why database teams need human factor training. Thank you.
[00:23.680 --> 00:34.360]  Thank you very much for coming to this talk. I think it's certainly one of the topics I'm
[00:34.360 --> 00:41.360]  most excited about when it comes to database-related topics, actually, even though I'm very, very
[00:41.360 --> 00:48.200]  much into Postgres. This topic really excites me. So, just introducing myself a bit for
[00:48.200 --> 00:54.000]  those who don't know me. I have a bit over 24 years of experience in Postgres, so almost
[00:54.000 --> 01:00.240]  25 years. I've built accounting software on it. I have worked as a database administrator
[01:00.240 --> 01:07.440]  on very large databases, built database teams, managed infrastructure, a lot of that sort
[01:07.440 --> 01:12.320]  of thing. So, I have a wide range of experience. I have submitted several patches in for Postgres,
[01:12.320 --> 01:18.120]  so which one has been accepted, and I'll probably be submitting more patches at some point in
[01:18.120 --> 01:26.080]  the future. So, I absolutely love Postgres for its extensibility. And with that, of course,
[01:26.080 --> 01:31.160]  comes some complexity, some difficulties in kind of maintaining our mental models about
[01:31.160 --> 01:38.440]  how things are actually working. And especially if you're working at things in scale, it's
[01:38.440 --> 01:43.360]  really easy for your mental model not to match what's actually happening, and then sometimes
[01:43.360 --> 01:53.320]  we make mistakes and things happen. So, this talk is basically going to be about two things.
[01:53.320 --> 02:02.600]  The first thing is something our industry doesn't do very well. And that is how we look at human
[02:02.600 --> 02:11.280]  error, and how we can possibly do that better. I kind of want to talk a little bit about
[02:11.280 --> 02:17.800]  how we can improve, and what the benefits are that we can expect from some of the early
[02:17.800 --> 02:24.880]  steps that we can take as an industry. So, this is very much a talk about database people.
[02:24.880 --> 02:31.760]  It's a talk about us. It's much less a talk about like a specific technology. But a lot
[02:31.760 --> 02:38.680]  of the same technical approaches apply. So, I want to give a few thanks, first of all,
[02:38.680 --> 02:45.200]  timescale for paying me to come and do this. Wouldn't really be feasible for me to fly
[02:45.200 --> 02:51.000]  from an issue without them. But I also really want to thank two of my prior employers. I
[02:51.000 --> 02:57.360]  want to thank Adjust, where we were actually able to bring in aviation training on these
[02:57.360 --> 03:05.400]  human factors. So, we brought in a company that did training for pilots, as well as doctors.
[03:05.400 --> 03:09.480]  And a lot of the training was really eye-opening, and it allowed us to do some things that we
[03:09.480 --> 03:16.440]  couldn't do before. This was really a grand experiment, and it had a number of really
[03:16.440 --> 03:22.020]  big, tangible benefits. And then, of course, I also want to thank Delivery
[03:22.020 --> 03:28.240]  Hero, who I worked after that, where I was able to kind of work with people and evaluate
[03:28.240 --> 03:33.320]  both the successes and the shortcomings of what we had done at Adjust, and further develop
[03:33.320 --> 03:43.320]  some of these ideas. So, these are important areas, and I would also say that I'm involved
[03:43.320 --> 03:54.080]  in trying to help implement some of these things also at timescale. So, introduction.
[03:54.080 --> 04:00.480]  So, just as a, this is a completely rhetorical question. You don't have to raise your hand
[04:00.480 --> 04:05.880]  if you don't feel comfortable doing so. But how many of us have been on a team where somebody
[04:05.880 --> 04:09.840]  has been working on a production database while they're drunk?
[04:09.840 --> 04:16.840]  Yes, I see. I mean, as we go through our career, almost every single one of us will probably
[04:16.840 --> 04:27.840]  have that experience. And yet, how many times does it cause a major problem? Almost never.
[04:27.840 --> 04:33.480]  At least, I've never seen it cause a major problem. Now, part of that may be the context
[04:33.480 --> 04:38.040]  in which it happens, like, you know, the subject matter experts spent out partying was not
[04:38.040 --> 04:43.200]  expecting to, was not really on call and has now been called in an escalation. Somebody
[04:43.200 --> 04:48.800]  else may be handling a lot of the sort of wider incident strategy stuff where maybe
[04:48.800 --> 04:57.200]  alcohol might be a bigger problem. But at least in these contexts, alcohol doesn't seem
[04:57.200 --> 05:06.240]  to be a big factor in the further disruption of things once stuff's going on. But let me
[05:06.240 --> 05:13.160]  ask another question. How many people here have seen a case where a major incident or
[05:13.160 --> 05:24.160]  outage happened because somebody made a mistake because that person was tired? See? So, we
[05:24.160 --> 05:29.960]  valorize the thing that causes us problems. Well, we demonize something that probably
[05:29.960 --> 05:35.880]  does cause some problems, no doubt. And maybe the demonization helps prevent more problems,
[05:35.880 --> 05:43.200]  but we valorize something that causes a lot more problems. Okay? Why isn't we do that?
[05:43.200 --> 05:49.160]  How is it that we should stop doing that and actually we think our priorities? Now, on
[05:49.160 --> 05:57.920]  one side, this is a good example of human error, right? We can talk about all the factors
[05:57.920 --> 06:06.560]  that go into that prioritization. But on the other side, it's also partly because we don't
[06:06.560 --> 06:14.760]  understand human error in our field, right? When we do a post-mortem, if somebody made
[06:14.760 --> 06:18.840]  a mistake, we just say, oh, human error, and that's it. I'm going to come back to that
[06:18.840 --> 06:28.080]  point in a few minutes. So, drunkenness versus fatigue. Now, if one person drinks, say, a
[06:28.080 --> 06:33.720]  bottle of wine, and another person, one group of people drinks each a bottle of wine, and
[06:33.720 --> 06:38.240]  the other group of the people has, say, their sleep disrupted, so they're only sleeping
[06:38.240 --> 06:45.080]  four hours, and then get up, and a few hours later, they're in the other group. Give both
[06:45.080 --> 06:55.320]  of them complex tasks to perform. Who's going to perform worse? Sleep deprivation causes
[06:55.320 --> 07:03.000]  heavier cognitive deficiencies. Four hours of sleep, missing sleep, is worse than four
[07:03.000 --> 07:15.520]  drinks. Now, obviously, there are some tasks where that's not the case, like driving a
[07:15.520 --> 07:21.120]  car or something, because you also have coordination problems induced by the alcohol. From a peer
[07:21.120 --> 07:31.320]  information processing standpoint, having four hours of sleep only is worse than drinking
[07:31.320 --> 07:43.880]  a bottle of wine, and it's going to last at least the next day. So, totally, totally worth
[07:43.880 --> 07:55.560]  thinking about that. So, now that I've talked about, like, one aspect of human error, one
[07:55.560 --> 08:00.640]  thing that can induce a lot of human error, I want to talk about a brief history of why
[08:00.640 --> 08:11.720]  this field became really big in aviation. So, back in the 1950s, 1960s, 80% of the aircraft
[08:11.720 --> 08:16.760]  accidents or incidents were blamed on pilot error. Notice I didn't say human error, I
[08:16.760 --> 08:22.720]  said pilot error. I'm going to come back to that distinction in a moment. In fact, I think
[08:22.720 --> 08:31.160]  the number might have been closer than 90%. Today, our incident and accident rates in
[08:31.160 --> 08:38.720]  airlines are well over 100 times lower than they were at that point. So, if you think
[08:38.720 --> 08:44.400]  about it, improvements in the technology in men and of the airplanes could only account
[08:44.400 --> 08:52.240]  for maybe 10% of that improvement. All of the rest of this is due to much better understanding
[08:52.240 --> 08:59.320]  of the question of human error. There's been a shift from focusing on pilot error to focusing
[08:59.320 --> 09:04.720]  on human error. When the aviation industry talks about human error, they don't mean somebody
[09:04.720 --> 09:11.040]  made a mistake, and that's where they leave it. They have a rich taxonomy of understanding
[09:11.040 --> 09:21.080]  kinds of human error, causes of each of these particular types of errors, and sort of practices
[09:21.080 --> 09:27.280]  to try to mitigate them. The way I would actually describe this difference is that if you're
[09:27.280 --> 09:31.400]  debugging software and it's connecting to the database, and every time, let's say you
[09:31.400 --> 09:35.880]  have an error in your query or something the database can't fulfill your request, it just
[09:35.880 --> 09:41.400]  says something went wrong. You're not going to be able to debug that software at all,
[09:41.400 --> 09:48.600]  and you're probably going to have a lot of trouble. That's kind of what we do currently
[09:48.600 --> 09:53.040]  when we say human error. We just simply say the person made a mistake, and that's as far
[09:53.040 --> 09:59.320]  usually as we look. The aircraft industry has actually come up with something with a much
[09:59.320 --> 10:06.840]  richer understanding of this topic and sort of richer system of almost like error codes
[10:06.840 --> 10:14.160]  that they use when they talk about these issues. The reason is it's a very unforgiving environment.
[10:14.160 --> 10:20.280]  If you make a mistake, you might or might not be able to recover, so you have a lot
[10:20.280 --> 10:31.000]  of training on this, and now the chance of a massive disaster is down probably one error
[10:31.000 --> 10:43.400]  disaster per billion takeoffs, which is really impressive. We'd love to have that.
[10:43.400 --> 10:47.640]  So they've made this shift. They've also made a shift that we've already made, and it's
[10:47.640 --> 10:52.560]  worth pointing this out, and that's that they've made a shift from individual responsibility
[10:52.560 --> 10:58.440]  to collective responsibility. In our terms, we call that blameless culture. Somebody makes
[10:58.440 --> 11:03.000]  a mistake. We don't blame them. We don't go, hey, stop making mistakes. We try to find
[11:03.000 --> 11:07.240]  some way to keep that mistake from happening, but because we don't have a clear understanding
[11:07.240 --> 11:13.680]  of this topic, we try to solve this in ways that maybe aren't as effective as they could
[11:13.680 --> 11:20.480]  be. I want to give one really good example of sort of a watershed moment. Actually, before
[11:20.480 --> 11:27.960]  I talk about that, let me just discuss David Beatty's contribution quickly. Beatty was
[11:27.960 --> 11:36.000]  a cognitive psychologist and pilot in the U.K., and in 1969, he wrote a seminal book
[11:36.000 --> 11:41.480]  called The Human Factor in Aircraft Accidents, where he basically looked at the kinds of
[11:41.480 --> 11:47.560]  mistakes that happen and the kinds of circumstances that lead to those mistakes. There are newer
[11:47.560 --> 11:54.360]  versions of that book out now. It's actually worth reading, but probably not best to read
[11:54.360 --> 12:07.000]  if you're a nervous flyer. As a good description of how we break down error, it was the industry
[12:07.000 --> 12:13.600]  starting point. Ten years after that, there was I think the watershed moment in how this
[12:13.600 --> 12:19.120]  became really big within aviation, and that was the Tenerife disaster. Tenerife disaster
[12:19.120 --> 12:28.440]  was the most deadly aircraft accident still in history. It happened on the ground in Tenerife
[12:28.440 --> 12:34.560]  due to a variety of factors. I'm not sure how much detail I should go into in this talk,
[12:34.560 --> 12:40.680]  but the end result was basically that 1-747 tried to take off down a runway with limited
[12:40.680 --> 12:49.000]  visibility without a proper takeoff clearance, and they hit another 747 on the ground. Clear
[12:49.000 --> 12:55.800]  case of human error, and the Spanish report on this more or less blamed it on pilot error.
[12:55.800 --> 13:02.160]  This guy tried to take off. He didn't have the clearance. It was his fault. The Dutch
[13:02.160 --> 13:12.240]  report, which is often criticized in some documentaries that I've seen on this, was
[13:12.240 --> 13:15.840]  very, very different. What they actually did was they asked, why did he try to take off
[13:15.840 --> 13:22.160]  without clearance? What was going through, how did that mistake happen? The thing was,
[13:22.160 --> 13:28.080]  he was an extremely experienced pilot. He was their chief pilot. He actually didn't fly
[13:28.080 --> 13:33.520]  airplanes that much. He was mostly sitting on simulators. The thing was, at that time,
[13:33.520 --> 13:40.400]  when you were the senior pilot in the simulator, you were giving the clearances. A stressful
[13:40.400 --> 13:49.800]  situation, visibility is slow, there's pressure to take off, stressful situation. He goes
[13:49.800 --> 13:54.200]  back to what he's used to doing, which is assuming he has the clearance because he's
[13:54.200 --> 14:00.840]  used to giving them to himself. Airlines don't do that anymore in their simulators,
[14:00.840 --> 14:10.320]  for obvious reasons. The Dutch report actually became the basis for how the aviation industry
[14:10.320 --> 14:16.080]  has started to look at human error ever since. As a result, what we've seen is we've seen
[14:16.080 --> 14:23.720]  this massive, massive improvement in safety. Every pilot in every airline gets this sort
[14:23.720 --> 14:31.920]  of training, and it has made our flights much, much, much, much safer.
[14:31.920 --> 14:36.880]  The question is, can we benefit from the same thing? The answer is yes. We actually can get
[14:36.880 --> 14:45.800]  a lot more benefits from it than just reducing incidents and recovering from them. In fact,
[14:45.800 --> 14:51.360]  if you look at the standard definition that people give of crew resource management, it's
[14:51.360 --> 14:57.840]  the use of all available resources to ensure the safe and efficient operation of the aircraft.
[14:57.840 --> 15:05.000]  If we can use all of our resources to improve both safety and efficiency, that's going
[15:05.000 --> 15:11.720]  to make our jobs better, we're going to be more productive, we're going to be happier.
[15:11.720 --> 15:17.360]  This is actually a really, really important field that I think that we need to improve
[15:17.360 --> 15:22.000]  on. Now I'm going to talk about how we look at
[15:22.000 --> 15:33.000]  human error in the industry. Human error typically in the DevOps and SRE systems, we have one
[15:33.000 --> 15:38.920]  answer to human error. What that is, automation. If somebody made a mistake, we're going to
[15:38.920 --> 15:45.120]  automate that away. We're just going to add more automation. We're going to add more automation.
[15:45.120 --> 15:51.360]  It seems like a great idea. Computers are infallible, we're fallible, so we're just
[15:51.360 --> 15:58.520]  going to use the computers to prevent the mistake. The problem with this, the IEEE
[15:58.520 --> 16:06.080]  has done a bunch of research on something they call the automation paradox. The automation
[16:06.080 --> 16:19.600]  paradox is that the more reliable the automation is, the less opportunities humans have to
[16:19.600 --> 16:25.440]  contribute to the overall success of that. I think I'm going to take a little bit of
[16:25.440 --> 16:31.080]  time here to talk about why that is the case, and then I'll get reinforced in the next section
[16:31.080 --> 16:41.440]  when we talk about why we make mistakes. But to start with a basic summary, obviously
[16:41.440 --> 16:46.440]  we need automation because there are certain kinds of tasks that we're actually very bad
[16:46.440 --> 16:53.200]  at following, and there are certain kinds of requirements where automation can really
[16:53.200 --> 17:00.480]  save us a lot of safety considerations. So steps that have to be done together really
[17:00.480 --> 17:08.640]  should be automated so that they happen together. But automation is just done reflexively, at
[17:08.640 --> 17:15.040]  least according to a lot of the research that's come out of the IEEE as well as many of the
[17:15.040 --> 17:22.560]  aviation study groups on this, is that simply throwing automation at a problem can actually
[17:22.560 --> 17:27.280]  make human error more common and can make human error more severe, and then when things
[17:27.280 --> 17:41.240]  are out of whack, you have no possibility at all of saving, of preventing a major incident.
[17:41.240 --> 17:46.920]  Part of the reason here is that we process all of what we see through a mental model,
[17:46.920 --> 17:56.080]  and so when we add more complexity, when we add more automation around a lot of this,
[17:56.080 --> 18:02.400]  we make it really, really, really, really, really hard for us to keep that mental model
[18:02.400 --> 18:07.880]  reasonably in sync with reality. And then when something goes wrong, we can spend a
[18:07.880 --> 18:13.920]  lot of time and effort struggling to understand what's going on, or we may reflexively react
[18:13.920 --> 18:24.320]  in ways which actually make the problem worse. So automation isn't the answer, it is part
[18:24.320 --> 18:31.320]  of an answer. And reflexive automation, oh, we had a problem that's automated away, is
[18:31.320 --> 18:46.240]  not the answer. Now, I mentioned just a moment ago this issue of mental models. We humans,
[18:46.240 --> 18:50.640]  we operate in a world that's very different from the way computers operate. Computers
[18:50.640 --> 19:00.200]  are basically systems that mathematically process inputs and produce outputs, right?
[19:00.200 --> 19:06.480]  And therefore, computing programs basically operate in a closed world. We humans don't
[19:06.480 --> 19:12.480]  operate in closed worlds, right? We all operate in open worlds. We have the situation where
[19:12.480 --> 19:19.720]  we know we don't know everything. We know we don't know some things. We know, well,
[19:19.720 --> 19:24.200]  and then we don't know other things that we don't know we don't know, right? Some cases
[19:24.200 --> 19:32.400]  we know we don't know what we don't know, right? But in order to function, we have to
[19:32.400 --> 19:36.320]  maintain these mental models, and those mental models are necessarily a simplification of
[19:36.320 --> 19:42.160]  reality. And so when something's going wrong, we have to dig into how we think the system
[19:42.160 --> 19:49.200]  works and we have to kind of go through that. And the more complexity we throw into our
[19:49.200 --> 19:55.920]  automation, the harder that process becomes, right?
[19:55.920 --> 20:03.600]  So automation, as I say, is an important tool. I'm going to talk in a few moments about
[20:03.600 --> 20:08.840]  good automation versus bad automation, but it's something that we can't rely on to solve
[20:08.840 --> 20:15.720]  the human error problem. So I mentioned that I talked about good automation versus bad
[20:15.720 --> 20:22.080]  automation. I think this is really, really, really important here. So oftentimes what
[20:22.080 --> 20:28.560]  I've seen happen is that you end up with large automated systems, whether they're something
[20:28.560 --> 20:39.840]  like Ansible or Rax or Kubernetes or whatever. And oftentimes there isn't a clear understanding
[20:39.840 --> 20:48.960]  of how these things work underneath. Now, if people do understand all of that, and they've
[20:48.960 --> 20:54.800]  built in a lot of these things, then a lot of that's going to be a lot easier, right?
[20:54.800 --> 21:01.360]  So good automation is basically going to be a deliberate and engineered process, right?
[21:01.360 --> 21:06.240]  Rather than something that's thrown together in the course of the messy world of operations,
[21:06.240 --> 21:12.120]  it is a deliberate process which is designed around two factors and three factors, actually.
[21:12.120 --> 21:17.280]  The first factor is the system, right? The second factor is the people, and we usually
[21:17.280 --> 21:23.280]  forget that one. And then the last one is that we actually need to be thinking about
[21:23.280 --> 21:34.080]  the human machine interaction, right? So good automation takes the people into account.
[21:34.080 --> 21:42.000]  Good automation is something which has built in decision points where the person can actually
[21:42.000 --> 21:51.200]  sit there and say, hmm, this isn't going right. We're not going to proceed, right? And good
[21:51.200 --> 22:10.120]  automation is sort of then a well-understood process, right? So the other thing that is
[22:10.120 --> 22:14.560]  really important as we look at automation is this issue of feedback, right? Because the
[22:14.560 --> 22:19.800]  more we automate, typically the more we insulate the individual from the feedback of the individual
[22:19.800 --> 22:27.160]  steps that would be right, right? So it's really super important to sit down and think
[22:27.160 --> 22:34.360]  about what's the person going to see? What's the human going to see? How's the human going
[22:34.360 --> 22:40.400]  to be able to interpret this? How much feedback do we want to send? Do we want to send everything
[22:40.400 --> 22:46.520]  that we got? Do we want to send some summary of it? And those are going to be decisions
[22:46.520 --> 22:51.880]  that have to be made deliberately based upon the context of what we're doing, as well as
[22:51.880 --> 22:56.600]  a clear understanding of what the failure cases of the automation are. And then of course
[22:56.600 --> 22:59.760]  people actually need to be trained on what the automation is actually doing under the
[22:59.760 --> 23:05.560]  hood so that they understand it, rather than just simply saying, oh, push button, okay,
[23:05.560 --> 23:16.280]  everything good. So the way I always look at it is a lot of people think automation
[23:16.280 --> 23:23.240]  basically, a lot of people think checklists are a step towards automation. I think that
[23:23.240 --> 23:30.600]  automation should be a step towards a checklist, okay? The relationship should actually be
[23:30.600 --> 23:35.600]  something around on the other side so that you're thinking about how do I want the human
[23:35.600 --> 23:40.120]  to interact with this? How do I want the human to perform these? Where do I want the human
[23:40.120 --> 23:46.080]  to be able to say this isn't going while we are stopping? And those are the sorts of questions
[23:46.080 --> 23:50.320]  and designs that we have to think about when we're dealing with especially these sorts
[23:50.320 --> 23:55.760]  of critical systems like the databases, where if the database is down, you know, the business
[23:55.760 --> 24:16.000]  maybe now. So now I want to talk a little bit about why we make mistakes. Now, I'm
[24:16.000 --> 24:22.960]  mentioned before, computers operate in a closed world, right? They get inputs from us. They
[24:22.960 --> 24:34.720]  do processing. They give us outputs, right? We live in an open world. We have, we experience
[24:34.720 --> 24:41.120]  things, we perceive things. What we perceive is not a complete model of, or it's not even
[24:41.120 --> 24:48.720]  complete aspect of what our mental models are. We make inferences based on incomplete
[24:48.720 --> 24:57.440]  data, okay? And in order to function in this world, we have had to adapt and develop certain
[24:57.440 --> 25:03.760]  kinds of cognitive biases, okay? And a lot of times people look at this and they go,
[25:03.760 --> 25:09.480]  oh, it's not good to be biased. Bias is a bad word. We don't like biases. But the fact
[25:09.480 --> 25:13.240]  of the matter is that if you could get rid of all of your cognitive biases, you would
[25:13.240 --> 25:20.520]  be unable to function, okay? Confirmation bias, of course, is one thing that we tend
[25:20.520 --> 25:27.600]  to be aware of. But here's another one, continuation bias. Continuation bias is the tendency to
[25:27.600 --> 25:34.960]  continue to follow a plan you've put in motion, even when you're starting to get good indications
[25:34.960 --> 25:41.120]  that that's not a good idea, okay? If you didn't have continuation bias, you might have
[25:41.120 --> 25:49.480]  to sit down and rethink your plan continuously over and over and over again, right? That
[25:49.480 --> 25:55.200]  wouldn't be very helpful. So continuation bias, just like confirmation bias, actually
[25:55.200 --> 26:00.760]  helps us function in the real world. Problem is, it can also lead us into situations where
[26:00.760 --> 26:05.840]  we do the wrong thing. And so understanding these biases, understanding their implications
[26:05.840 --> 26:14.400]  is very clear, is a very important step to being able to notice when they're causing
[26:14.400 --> 26:21.120]  problems and start to trap those sorts of problems. So rather than trying to eliminate
[26:21.120 --> 26:29.000]  our biases, which is, I think, a way in which I see people typically trying to do this,
[26:29.000 --> 26:34.360]  is better to think about what kinds of problems the biases can cause and how we can detect
[26:34.360 --> 26:42.520]  and trap those problems, right? And there are a large number of these biases, right?
[26:42.520 --> 26:47.840]  Expectation bias. Expectation bias is also related to confirmation bias. It's the tendency
[26:47.840 --> 26:56.960]  to filter out perceptions that don't match your expectations, right? This happens today
[26:56.960 --> 27:02.720]  in a lot of environments. It happens in our industry. It obviously still happens in aviation,
[27:02.720 --> 27:08.160]  fortunately, usually not with serious problems. The most common problem it causes there is
[27:08.160 --> 27:13.480]  that the plane comes up to the gate, the pilot says disarmed doors and cross check. Somebody
[27:13.480 --> 27:17.760]  misses the door that's going to be opened. The other person cross checks and expectation
[27:17.760 --> 27:23.000]  bias kicks in and they don't notice that the door is still armed. Go to open the door and
[27:23.000 --> 27:27.600]  guess what happens. Emergency slide deploys. It doesn't harm anybody on the airplane, but
[27:27.600 --> 27:34.280]  it's going to make a bunch of people unhappy because the next leg on the airplane's flight
[27:34.280 --> 27:39.800]  is going to get canceled. And that's usually the worst that happens.
[27:39.800 --> 27:46.720]  But these are important things and we have to recognize that these sorts of biases are
[27:46.720 --> 27:55.200]  going to happen and that our ability to maintain a situation awareness in the course of these
[27:55.200 --> 28:07.760]  biases is very much tied to, very much tied to how aware we are of the kinds of problems
[28:07.760 --> 28:13.600]  that they can cause, right? Because, you know, we form this mental model. We're going to
[28:13.600 --> 28:18.200]  interpret things according to that mental model. We're going to continue our existing
[28:18.200 --> 28:23.960]  plans and things like that. And when somebody says, hey, wait, maybe this isn't right, then
[28:23.960 --> 28:29.000]  that's suddenly an opportunity to go, hey, my bias is maybe leading me astray. Let's
[28:29.000 --> 28:33.600]  sit down and figure out what's going on and verify.
[28:33.600 --> 28:39.200]  And human factors training actually tends to include exercises or training specifically
[28:39.200 --> 28:55.680]  aimed at doing that. So, second major issue is reversion to prior behavior under stress.
[28:55.680 --> 29:02.520]  Something that happens to all of us when we're under stress, our focus narrows, right? We
[29:02.520 --> 29:07.520]  start filtering things out and we start resorting to habit.
[29:07.520 --> 29:13.080]  What this also means is that in a database team, when there's an outage, if we're not
[29:13.080 --> 29:18.960]  careful, we will resort to the things that we're used to doing, even if we have decided
[29:18.960 --> 29:23.880]  that they're not maybe the best ways forward. And, you know, I've watched cases where, you
[29:23.880 --> 29:28.960]  know, incidents happen and, you know, if a company has been really trying to move towards
[29:28.960 --> 29:34.440]  a more collaborative approach to incidents, that suddenly when the incident happens, people
[29:34.440 --> 29:39.120]  are getting stressed out and they're going back to this like hyper-individualistic cowboy
[29:39.120 --> 29:45.760]  incident response. And a lot of that is just simply due to stress. It's a very well-documented
[29:45.760 --> 29:52.080]  part of the stress response. One thing that we got at a just with the human factors training
[29:52.080 --> 30:00.480]  was a strong understanding of that problem as well as good understandings of how to measure
[30:00.480 --> 30:06.800]  the stress so that we could actually kind of keep an eye on it.
[30:06.800 --> 30:13.360]  Another major point that causes problems, and I've alluded to this before, is fatigue.
[30:13.360 --> 30:18.400]  How often do we see people who have a rough on call night, who come back in the next day
[30:18.400 --> 30:25.000]  and start working on stuff? How often are we willing to say to that person, no, go home,
[30:25.000 --> 30:35.280]  get some rest, I don't want you working on this stuff right now? Right? How often have
[30:35.280 --> 30:42.560]  we seen people who are on call for an extended time period and a rough shift make mistakes
[30:42.560 --> 30:50.720]  after several days of continuous sleep interruptions? You know, do we start to think about the question
[30:50.720 --> 30:57.440]  of maybe when this happens, we should be switching these people out more frequently.
[30:57.440 --> 31:04.440]  In the airlines, before any flight happens, the flight crew get together and they check
[31:04.440 --> 31:10.280]  out how each other are doing, right? And there is an expectation that there is a standby
[31:10.280 --> 31:15.280]  flight crew so that if you're not feeling your best, you can say, hey, I didn't sleep
[31:15.280 --> 31:23.520]  well last night, I don't want to fly. And that's another thing which has really helped
[31:23.520 --> 31:29.440]  the increase of the safety, something we should probably think about doing. You're getting
[31:29.440 --> 31:38.440]  tired from the on call, time to switch you out. Do we? I have never worked anywhere
[31:38.440 --> 31:53.800]  the day. So a final major point on how and why we make mistakes has to do with a term
[31:53.800 --> 32:00.320]  in human factors, Lingo, called workload. Now, I don't like this term in this context
[32:00.320 --> 32:05.520]  because when we say workload in here, everybody is thinking, oh, I have so many things I need
[32:05.520 --> 32:12.000]  to get done this month. But in the human factor side, workload doesn't mean over the next
[32:12.000 --> 32:17.560]  month or over the next week, although planning that can be helpful. What it really means
[32:17.560 --> 32:26.760]  is how many tasks are you having to pay attention to right now? How many people here can actually
[32:26.760 --> 32:33.720]  listen to and understand two conversations at the same time? Nobody? Maybe it's possible
[32:33.720 --> 32:41.440]  for some people to train that. But a brain can't, there are certain kinds of things
[32:41.440 --> 32:49.920]  that our brains can't parallelize very well. Understanding where those boundaries are.
[32:49.920 --> 32:57.760]  Switching and flipping between tasks. How much can we reduce that workload? That's actually
[32:57.760 --> 33:01.640]  really important because one of the things I've seen happen is you have your standard
[33:01.640 --> 33:06.760]  runbook and the way most people write their runbooks is you have step, explanation, discussion
[33:06.760 --> 33:12.680]  of output, next step. What happens at three in the morning if you've never done this particular
[33:12.680 --> 33:22.160]  process is step. Okay? Yes, it did what I expected to. Where's the next step? It becomes
[33:22.160 --> 33:31.920]  really, really, really easy to miss the next step in your checklist or to miss critical
[33:31.920 --> 33:38.200]  details that are kind of obscured in the fact that now you're having to read through paragraphs
[33:38.200 --> 33:43.760]  at three in the morning while troubleshooting a broken system. One of the things that I
[33:43.760 --> 33:54.760]  did while it was at adjust is I started writing some of our, I guess I would call them unusual
[33:54.760 --> 34:02.000]  procedure checklists. A non-normal procedure checklist. So things that happen when, things
[34:02.000 --> 34:06.040]  that you do when something goes wrong. Things that you might have to do at three in the
[34:06.040 --> 34:11.800]  morning without doing them for any of the previous three months. And what I ended up
[34:11.800 --> 34:15.800]  doing in this case, and it was actually, this is a good opportunity to talk about some of
[34:15.800 --> 34:30.280]  the main benefits of this sort of training, is that we talked about, we talked about basically
[34:30.280 --> 34:37.440]  what we did was we did the following format. It's a bullet point. Here's what you can ideally
[34:37.440 --> 34:45.080]  copy and paste into the terminal. Expected output, warning signs, all in bullet points
[34:45.080 --> 34:53.280]  and then back, unindented again, the next bullet point. So they're hierarchical, it's
[34:53.280 --> 34:58.040]  easy to scan, but then your main points are all really, really, really short. And then
[34:58.040 --> 35:01.840]  all of the major description that would be in those paragraphs would be moved into foot
[35:01.840 --> 35:09.360]  notes. And those would all be hyperlinked. So you run a test, you know, you run a step.
[35:09.360 --> 35:12.440]  Something doesn't look quite right, you want to see the longer description, you click that
[35:12.440 --> 35:18.600]  hyperlink, you come down to the footnote, you read the whole thing, decide if you want
[35:18.600 --> 35:28.760]  to proceed or not, and then decide. And what this allowed us to do was to take, like the
[35:28.760 --> 35:36.840]  standard platform team people who are on call and actually have them do error spike maintenance
[35:36.840 --> 35:44.280]  at three in the morning on, as I say, the super critical high-speed database system.
[35:44.280 --> 35:51.360]  And before that, every time there was an error spike issue, it was an automatic escalation.
[35:51.360 --> 35:54.440]  And it was automatic escalation because we didn't trust that they would be able to do
[35:54.440 --> 35:59.160]  it or make proper decisions around it. But since we formalized it into checklists and
[35:59.160 --> 36:03.800]  we offered some training on them, and we tried to make sure that people kind of understood
[36:03.800 --> 36:10.040]  the overall considerations of the processes, then they could do some basic stuff and then
[36:10.040 --> 36:20.840]  call us if there were questions that weren't obviously answered by the documentation.
[36:20.840 --> 36:26.080]  Every very, very good tangible benefit meant that instead of several people waking up in
[36:26.080 --> 36:32.520]  the middle of the night, they could be done by the on-call engineer.
[36:32.520 --> 36:38.880]  So that's a really good example of the benefits that come out of paying attention to that
[36:38.880 --> 36:43.640]  workload issue and the sensory overload that happens that's much more serious at three
[36:43.640 --> 36:53.240]  in the morning than at three in the afternoon. So at this point, it's really important to
[36:53.240 --> 36:59.960]  recognize that at this point, we're no longer really talking about human error being somebody
[36:59.960 --> 37:06.960]  made a mistake. Instead, we're talking about the need to be able to debug the person and
[37:06.960 --> 37:17.560]  why they made the mistake. And this is something which very often times we don't even try to
[37:17.560 --> 37:27.560]  do in our industry, but we should. This requires that we have a really good taxonomy of types
[37:27.560 --> 37:38.320]  of mistakes. That we can say, okay, situation awareness laps because of sensory overload
[37:38.320 --> 37:44.040]  from too many monitoring alerts going off. A very common one that happens in our industry.
[37:44.040 --> 37:55.640]  It's also something that's caused airplane issues. So if we understand that, we know,
[37:55.640 --> 38:01.800]  they lost their situation awareness. They couldn't understand where the problem was. This happened
[38:01.800 --> 38:06.840]  because they had too many alerts they were trying to focus on. Now the question is, are
[38:06.840 --> 38:11.160]  we actually throwing too many alerts? Do we need to think about maybe prioritizing things
[38:11.160 --> 38:17.800]  differently? Do we need to rethink how we do alerting? And suddenly, we have a dimension
[38:17.800 --> 38:23.840]  for looking at these problems that we currently don't have. Instead, currently, what happens
[38:23.840 --> 38:28.360]  most places I've worked is, okay, something went wrong. We didn't spot it. Therefore,
[38:28.360 --> 38:33.480]  let's add another alert over this. But when I was at the delivery here, we actually had
[38:33.480 --> 38:43.680]  a major incident where somebody, again, missed a problem relating to a database, relating
[38:43.680 --> 38:49.800]  to a Postgres instance, I believe, if I remember right. Despite the fact that it was well
[38:49.800 --> 38:54.680]  alerted. I was talking to somebody afterwards and he says, do you know what the false positivity
[38:54.680 --> 39:00.040]  rate of our alerts are? And I'm like, no, it's like 99.8%. How do you expect somebody
[39:00.040 --> 39:08.040]  to spot the problem when almost all the time our alerts don't mean there's a real problem?
[39:08.040 --> 39:13.240]  Okay. Now what he meant by false positivity isn't what I would mean by it. I mean, there
[39:13.240 --> 39:18.120]  were problems that the alerts were alerting about, but they weren't like customer facing
[39:18.120 --> 39:31.040]  problems, right? So the second thing is we need a really good understanding of our cognitive
[39:31.040 --> 39:34.800]  biases and the functions that they provide to us and also the problems that they can
[39:34.800 --> 39:43.960]  lead us into, right? So one of the good examples is, hey, look, you know, I know you're about
[39:43.960 --> 39:49.240]  to do this. I'm not sure that's what the problem is. Can we think about this first, right?
[39:49.240 --> 39:52.400]  And as soon as somebody says that, that means that they're saying, my mental model is not
[39:52.400 --> 39:55.760]  the same as your mental model. One of us is wrong. We should probably figure that out
[39:55.760 --> 40:02.520]  before we proceed. Figuring out how to do that's really, really important, especially
[40:02.520 --> 40:06.200]  when we talk about social factors involved, right? It's one thing to do that with your
[40:06.200 --> 40:10.440]  peer when you're on an incident call and there are two of you there. Something very different
[40:10.440 --> 40:15.560]  to do when the person typing the words is very senior and you're very junior and there's
[40:15.560 --> 40:21.760]  somebody C-level popping into the call to ask for an update. I've been there. I've done
[40:21.760 --> 40:30.960]  that and yes, no, I have not raised the issue and I should have, right? You know, figuring
[40:30.960 --> 40:37.720]  out how to make these sorts of interventions and how to understand the intervention and
[40:37.720 --> 40:43.240]  how to respond to it, those are things that we actually need training on, right? We also
[40:43.240 --> 40:48.240]  need training on the social factors. We need to understand how power distance affects these.
[40:48.240 --> 40:51.840]  What happens when there's, you know, the C-level person in the call? How does that change your
[40:51.840 --> 40:58.880]  social interactions? How does that change your interactions in terms of debugging, right?
[40:58.880 --> 41:05.480]  Those are important things and that's one thing that we can get some really big improvements
[41:05.480 --> 41:14.000]  on relating to this. Finally, it's really important for us to be able to get to the
[41:14.000 --> 41:20.400]  point where we can contextualize the person. In other words, since we operate as humans
[41:20.400 --> 41:26.800]  in a relatively heuristic manner, right? We need to understand what the situation the
[41:26.800 --> 41:34.640]  human was in when the mistake happened and that's another thing that these sorts of trainings
[41:34.640 --> 41:45.160]  can help with. So, I've talked a little bit about social factors here. Power distance
[41:45.160 --> 41:50.280]  is what it sounds like, you know, how big the difference is between the most powerful
[41:50.280 --> 41:54.760]  person in the interaction and the least powerful person in the interaction, where we want it
[41:54.760 --> 42:00.440]  to be kind of, you know, not quite equal but much closer instead of like this, maybe more
[42:00.440 --> 42:09.120]  like this and, you know, figuring out how to structure things so that power distance
[42:09.120 --> 42:16.920]  doesn't cause a problem. That also means giving people good training on how to intervene when
[42:16.920 --> 42:22.520]  they see somebody much more senior about to make a mistake, you know, and you want to
[42:22.520 --> 42:27.480]  intervene in a way which is not threatening and in the event where there's somebody even
[42:27.480 --> 42:32.960]  higher in the call isn't going to be perceived as humiliating, right? Having good training
[42:32.960 --> 42:39.240]  on this and how to communicate in those cases is really, really important. And a lot of
[42:39.240 --> 42:45.680]  this ends up playing out into trying to create a work relationship between the people on
[42:45.680 --> 42:57.320]  the team which is very heavily mutually supportive and also kind of helps prevent or checks
[42:57.320 --> 43:08.720]  and traps the kinds of mistakes that each of us can make. So let's talk a little bit
[43:08.720 --> 43:24.400]  about the ideal role of humans in database operations. Now, we kind of need to understand
[43:24.400 --> 43:36.520]  this well. Okay, ten? Okay. Who's checking? Five? Okay, perfect. We kind of need to understand
[43:36.520 --> 43:42.640]  this. Humans need to be in control. We need to be the decision makers. We need to be the
[43:42.640 --> 43:49.480]  people who can say, this is what I think is going on. Let's go ahead and try this process.
[43:49.480 --> 43:54.640]  And halfway through that process goes, this is not going well. Let's back off, rethink,
[43:54.640 --> 44:01.080]  and make another decision, right? Partly because we're also operating heuristically, we can
[44:01.080 --> 44:08.640]  do things that computers can't, right? We need to maintain really good situation awareness.
[44:08.640 --> 44:12.920]  This means we need to have transparency in our complex automation. We need the automation
[44:12.920 --> 44:26.040]  to be built around helping us, not replacing us. And to do this, we need to be well rested.
[44:26.040 --> 44:32.200]  We need to be a clear peak capability ideally when we're in the middle of an incident. Now,
[44:32.200 --> 44:37.880]  we may not be able to completely manage that last part, but if we can take steps towards
[44:37.880 --> 44:48.040]  it and we can try to improve, we can do better, right? So a lot of this training is, at least
[44:48.040 --> 44:54.760]  what I've gotten out of it, is really important. What I think is really important about this,
[44:54.760 --> 45:00.960]  I'll just talk quickly about how to go about doing it, is that if we can get the organizational
[45:00.960 --> 45:06.680]  leverage behind the training, then we can actually turn the promise of the training
[45:06.680 --> 45:10.920]  into the reality. Sometimes you can't just teach people something and then have the
[45:10.920 --> 45:18.680]  punishment abandoned them. That doesn't work. So as an industry, we treat human error the
[45:18.680 --> 45:36.400]  way pilot error was treated in the 1950s. We have a whole lot to learn from aviation.
[45:36.400 --> 45:42.120]  Those lessons are already being played out in medicine and many other fields today. We
[45:42.120 --> 45:48.240]  need to do what we can to learn from it also. And it's really important to recognize that
[45:48.240 --> 45:54.280]  we can get really good improvements in reliability, efficiency, speed of development, all these
[45:54.280 --> 46:01.680]  sorts of things if we can better work with the human side of things. And I'm not talking
[46:01.680 --> 46:09.240]  about human managers rating performance. I'm talking about people on the team understanding
[46:09.240 --> 46:15.840]  performance for themselves and others. I just want to say that the three pieces of this
[46:15.840 --> 46:26.680]  are trying to get trainers in who have experience. Also, an organizational commitment to make
[46:26.680 --> 46:31.280]  it happen and then internally building your own programs and your own recurring trainings
[46:31.280 --> 46:36.760]  and your own training for new people. So that internally you have a big culture around it
[46:36.760 --> 46:41.400]  and you have experts who can think about it when it comes to be a post-mortem. So that's
[46:41.400 --> 46:47.760]  what I have. Any questions?
[46:47.760 --> 47:04.520]  Thank you. That was an amazing talk. Do you have any recommendations for further reading
[47:04.520 --> 47:07.840]  if you can't bring in experts?
[47:07.840 --> 47:19.360]  So this is a field which an aviation has a massive textbook industry. I think probably
[47:19.360 --> 47:24.880]  the best, sort of the most accessible book I would recommend starting with is the more
[47:24.880 --> 47:30.160]  recent versions of David Beatty's human factors and aircraft accidents. I think the most recent
[47:30.160 --> 47:34.240]  version is called the naked pilot of human factors and aircraft accidents. It's just
[47:34.240 --> 47:45.760]  referring to exposing the inner workings of the human piece of the aircraft. But again,
[47:45.760 --> 47:51.680]  if you're a nervous flyer, probably look for a crew resource management textbook instead
[47:51.680 --> 47:59.320]  because it may be less nerve wracking, it may be less intimidating, but it will have
[47:59.320 --> 48:09.080]  information there too.
[48:09.080 --> 48:13.880]  Do you have any recommendations for testing or drilling your processes like those checklists?
[48:13.880 --> 48:20.520]  Yes, I do. One thing that I think we really should figure out how to do is an industry
[48:20.520 --> 48:27.920]  and I completely believe in this. Obviously, like the chaos monkey idea and Netflix could
[48:27.920 --> 48:34.280]  be exploited to do this if you can also build war games around it. But the thing is it's
[48:34.280 --> 48:39.040]  really important to have drills, which means oftentimes you've actually got to probably
[48:39.040 --> 48:45.680]  simulate or create some sort of a potential incident that you have to come together and
[48:45.680 --> 48:51.360]  resolve. Now, ideally, you need to figure out how to do this without threatening your
[48:51.360 --> 48:56.120]  customer-oriented services. In some cases, maybe the cloud's a really good opportunity
[48:56.120 --> 49:04.720]  for that. But having those sorts of drills, maybe once a quarter or twice a year or something,
[49:04.720 --> 49:09.840]  can really give you an opportunity to spot problems, figure out improvements and actually
[49:09.840 --> 49:17.560]  go figure out what to do about those.
[49:17.560 --> 49:22.280]  Just kind of building on that last point is how do you justify the expense in time or
[49:22.280 --> 49:27.880]  money? Given that if this is successful, then nothing goes wrong. So it can sometimes be
[49:27.880 --> 49:33.120]  the outcome of success is that you're spending a lot of effort on apparently doing nothing.
[49:33.120 --> 49:37.040]  I don't believe that, but that's a reasonable thing that gets asked. How do you go about
[49:37.040 --> 49:43.520]  justifying the time or the money on this after it's successful?
[49:43.520 --> 49:49.600]  What I've usually done in the past is I make my points about, yes, we're going to improve
[49:49.600 --> 49:56.280]  our incident response. This will reduce our time recovery. It'll improve our reliability,
[49:56.280 --> 50:03.160]  et cetera. Maybe it'll improve our throughput organizationally. But then usually, people
[50:03.160 --> 50:08.280]  don't listen. And then usually, there are more incidents. And then you can come in and
[50:08.280 --> 50:13.480]  say, you know, these are specific problems that we had here where this training would
[50:13.480 --> 50:19.680]  help. And I usually find that after two or three of those, people start listening and
[50:19.680 --> 50:22.680]  go, oh, really? Maybe there is something to this.
[50:22.680 --> 50:44.880]  Thank you.