Hello everybody. Can you hear me? Is it working? Awesome. This is my first talk, so I'm just
going to do this because hello, FOSTA! Yes! Great to hear some energy in here at this
time. Right, so my name's James and my major client currently is a, so my major client,
my only client currently is a major European airline. Get that right? And I wanted to talk
to you today about some of the challenges that we're facing in introducing observability
to that client, a framework that I've kind of put together to overcome those challenges
and some thoughts that I have overall about observability. This talk should be applicable
to any big organization. So there's not really anything that's specific to an airline, but
if you think about the scale of not only the size, but the amount of different tasks
an airline would be doing and the kind of vintage of most major airlines, you'll kind
of get an idea of what we're talking about here. By the way, just as an idea, who here
works for like a company that's got more than a thousand people in it? Okay, fair enough.
Okay. And how many of those people are actually using observability on any scale? Okay, some
of you, awesome. You should be doing this instead. In this talk, I want to walk you
through three steps I'm taking to introduce observability. One, I'm calling an observability
transformation. We're not going to be talking about anything too technically exciting here,
and we're certainly not going to be talking about anything like introducing observability
to like the cockpit or anything like that. This talk is about helping you get your company
or client or whoever else on board with observability. It's about making that transition successful
and it's about making it sustainable. And of course, the associated love and adoration
of your peers for making their lives a whole hell of a lot easier. So, first thing I want
us to do is align us on what observability is. So, that'll be easy. Does anyone want
to, I tell you what, we're running late, so I'll just tell you what observability is.
Firstly, I think that what we've got to remember when we're talking about observability is
that a lot of people don't really know what to think of, but they're probably thinking
of something like this, like a big ten-foot view of everything that's going on. Obviously
most people in here won't think that that's observability. Why not? Can anyone say, like,
is this observable? Is this an observable system by our definition? This is what I think
of when I think of observability. And when I speak to anybody that, you know, may be
lay technical or non-technical, this is the kind of thing that I'll introduce to them.
I know that I'm putting a definition on something that, you know, and that's a little bit controversial,
but this is what I think of. So, this will help you kind of ground in what this talk
is about. So, you can imagine, like, as we went through that previous thing, like, there's
this cake being made. And so, you know, I can describe quite easily that previously,
with a, like, monitoring process, we would monitor, get the metrics and the logs from
each individual component of that system. But now what we're going to do is we're going
to collect the request for a cake through that system. And this has some clear value
if we start talking about this. There's this other way of talking about it that's like,
you know, observability is how we understand something by the internal something, I can't
say, but it doesn't really kind of get across the value to people that may be a bit skeptical
about this. And I think that this kind of does. So, let's just pocket that idea for
a second. This idea basically describes observability as recording work done to satisfy a request.
So, a request is completely observable when you can see all the work done to that request,
and a system is completely observable when you can see all the work done to all requests
moving through a system. This to me is much more tangible. It does tie it specifically
to requests or events. However, I do note that when we talk about observability and
making long running processes observable, most people try and arbitrarily or otherwise
find ways of cutting them up into individual traces anyway. So, I think that this is fairly
close to, like, how we're doing observability in practice. So, in my view, an observability
transformation fits alongside other transformations which, when done right, leads to much more
productive organizations. So, with Agile, we move from waterfall to more incremental
development. With DevOps, DevSecOps, all of that, we move from silos to more cross-functional
teams with cloud, like it or load it. We move from buying things up front and hoping that
they were the right things to buying things on tap as and when we needed it. So, with
observability, we're really talking about moving everything 90 degrees. So, instead
of observing individual systems, we're going to observe requests as they go through them.
This should also act as a warning. Just, who has gone through an Agile transformation?
Keep your hand up if you think that that went really, really well. Yeah. And I'm using this
word very, very specifically because this is another thing I want us to pocket as we're
going through this. You do need to think of this as a transformation and you need to think
about the kind of pitfalls of other types of transformations and how to overcome them
if you want to introduce observability to your company, client, whoever. Okay. So, we're
all aligned, please, on what observability is. We know we want it, but we don't get to
decide. So, we need to think about who we need to convince. Although you could probably
get away, especially in smaller or more agile companies, we're just convincing a couple
of people and going ahead, often with this sort of thing, you're going to have to convince
a lot of people. And so, this is me capturing three broad groups of stakeholders here that
you're going to want to convince if you want to bring people along with this observability
transformation. And you want to get everyone on board because if you only get, for example,
the C-suite on board, like the higher ups, if you like, on board, then engineers will
just make your product fail, make your transformation fail so that they can get back to their work,
like with any other thing. And then management will just say, right, I've just lost a load
of productivity, we can solve this by getting rid of this observability thing. Similarly,
if you get your engineers on board and they keep pushing towards it, you'll land up with
them being burnt out because they're not being given the time and the resources that
they need to actually make it work. So, it's worth thinking through very quickly here,
wary of time. I can spend ages on this slide, by the way, because thinking about stakeholders
is really, really interesting, but I'm just going to pick up a few highlights. As an example,
anyone here a skeptic would describe themselves as an observability skeptic? I'd imagine,
maybe, do you have any reasoning? No, that's fair enough. But it's worth noting that even
in here, and I think that there's lots of people outside, the thing I compare it to is
kind of transforming towards test-driven development. A lot of places will introduce test-driven
development and the way that they'll do it, for example, is their experience will be that
some manager somewhere will insist on 100% test coverage. So, they've gone through that,
they have to do all these ridiculous things to jump through hoops to get this transformation
to be complete, and then they come out at the end of it saying, well, test-driven development's
crap, we're not doing this. They managed to get rid of it and they managed to dump it.
So, you might think that of these three people, the engineers would be the easiest to convince,
but there are lots of people that are out there that have gone through three or four of these
now and really need to be sold on whether this is going to help them. So, really, don't think
that they're going to be automatically on your side just because you're convinced. Also,
I'll note all the disagreement that we have just in this one conference about what the best
tooling and the best approaches are anyway. Quickly on things like management, management
will want to be convinced that it's not going to break down productivity. One example I'll
give when we're looking at, for example, higher ups like the C-suite, they're going to be
interested, you're going to be asking them to spend money because you can't just say,
oh, we're going to do this, you want to actually resource a team. With my client, what we did
was we actually went through the outages over the last 12 months and we did some estimates,
we said there are estimates and we caveated like what the caveats are. We went through and
we worked out how much time we think would be saved on outages, on each of these outages,
if they had good instrumentation of their code and if we could identify the issues more quickly.
They could go away and they could calculate that in terms of a cost which they could use to justify
it. So, don't forget about your stakeholders. One thing that you didn't hear is in all of that,
is what tool to use. That's because, sorry, everyone that makes a tool, it largely doesn't
matter at this point. People want traces because they want less downtime, they want more clarity,
they want to capture lost revenue or whatever else. But you can do that with pretty much
any observability tool right now. So, the one thing you don't want to do as part of convincing
people is to try and sell them on a specific tool. That can come later. In my engagement,
we're focusing on tempo. And the reason that we're doing that, I'll introduce some of the
other reasons in a bit, but the main reason is because we always use Grafana, we already
use Prometheus, it slots right in. And we don't really have to discuss it much. There's another
thing which is because tempo is open source, we don't have to involve as part of selling this
project, a new vendor, and new commercials and stuff like that. So, open source to the
rescue with that. But really, you want to get your project approved so you can go and
start instrumenting code. Last thing I'll say on this is team topology. This is an example
of the sort of team that I'd expect to go and start an observability transformation.
You'd want, I prefer smaller, more agile teams. So, you might look at this and think, well,
based on my business, I might have two or three of your software engineers, two or three
of your operations engineers. That might be an anti-pattern. You can go and look up all
the reasons why bigger teams tend to do work more slowly. I'm not going to cover that now.
So, I'm looking at a kind of crack team. Software engineers are going to get in and go and instrument
the code. We've got an operations engineer that's going to make sure that we clear the
pathways to actually get those spans out into tracing databases. And finally, we've got
somebody that's kind of in a product owner position that's going to protect that team,
make sure that they're not answering inane questions all the time. And that is also going
to be working with the business and with the other product delivery teams and the platform
team and whoever else to make sure that concerns are raised, that they're heard, that they pivot
when they realize that actually they've made a mistake. So, that's an important role as
well. But remember, this is a transformation and we're trying to do new things. So, we're
changing cultures here. So, you do need to be responsive to feedback and you need to
be responsive to feelings. Otherwise, your engineers here are going to make the best
system that never gets used, which is another pitfall of transformations.
Okay. Those are my thoughts on convincing people to do an observability transformation.
Now, let's imagine you've got the thumbs up. Let's move towards implementation. Most important
thing is to not get bogged down in the details of the infrastructure. You need to move to
instrumentation. But, you are going to have to need some sort of tracing database. You are going to
need some sort of tooling. If you have something already, so for example, if you're already using a
provider of some sort and they have it, then great. Consider that. However, one of the ways that you can
make sure that these things move faster is by moving your tracing database into where the data is that
you're collecting. You think about big, old companies, big and or old companies. They get really
nervous when you say, right, we're going to collect all this data and we're going to go put in this cloud provider over here.
Now, that can take months to agree. And so what you can do is you can short circuit that, start that process,
start discussing how you're going to do this. But you can also at the same time move your tracing databases into
maybe the accounts or the cloud provider that's actually already been agreed to use this. There is a downside
here, which some of you might be thinking is, well, doesn't that mean, James, that you'd have maybe multiple
tracing databases, which means that you wouldn't have all your spans in the same place? And that is true, but it
means that you can move on to instrumentation. It means that you can move to the point where you have like maybe two
traces that somebody has to look to, and then you can get other people in the business to say, hey, wouldn't it be useful if,
and then you can start having the discussions. Don't try and boil the ocean on these things. And we're being
pragmatic here. So as an example here, this is if your client is in AWS, you can quickly get Tempo. There's a good article on
the Grafana website deploying Tempo on Fargate, which means that you can get that up nice and quickly. So again, that's an
advantage of using these things. And more importantly, you can deploy it. You can find out it's the wrong thing to do,
and you can go do something else. And it's a great thing about using these open source tools is you can really work it out as
you're moving. With that in mind, get instrumenting. And know that to start with that team that I put together earlier is going to be
doing a lot of the work themselves. Automatic instrumentation is your friend. Get your software engineer to go and find the code
bases that are across the system, especially on your hotpaths, and start raising PRs to auto instrument. You know how best to do
these in your company. Some companies, they want to start the conversation with a PR. Sometimes they want to start with a meeting
or something like this. But getting auto instrumentation in to these code bases will mean that you will start being able to build up the
shallow layer of these traces. Then if any teams start becoming interested in this, opportunistically pair your software engineer with
those teams. Pairing mobbing is a great way to share knowledge. Remember, a lot of these software engineers will not have done this kind of
thing before, and doing it's kind of hard if you don't know how to do it. You don't want them to get frustrated to throw in the towel
and say, no, this is dumb. This is hard. This is not the way we used to do things. Whereas if you can put your software engineer in with
a pair as a pairing or a mobbing situation, they will have happy times and everything will be lovely. Also, make sure that you point out the
value when you see it. It's very easy for us to see these things and to go, oh, it's great. And so obviously it's great. But this is new to
people. So point out the 10% of their queries that has like this weird choke point. Point out all these advantages you're getting from this
instrumentation and from all these spans as you're collecting them. Find, when there's an issue, when there's downtime, get your team to go and
see if they can race the people that are doing incident response to finding where the issue is based on the tracing. Once these teams realize that they can see
through walls with this stuff, they'll soon start instrumenting their own code. But you need to get them to look. Another trap is to get bogged down on
the problems that are harder to instrument. Airlines and banks and other places have a bad habit and that bad habit is Fortran. Or like Zidark or some
mainframe thing or whatever. If anyone here, has anyone here just put your hand up if you do any development on like COBOL, Fortran,
anything like that? Awesome, awesome. If you go an instrument something like that, please come and talk about it. That sounds awesome. That sounds like a
lot of people are going to talk about this one. I'll be fascinated by it. But if you're doing this kind of project, now is not the time. Something like that is not
really going to correct me if I'm wrong, anybody out there. I don't think there's any instrumentation for Fortran code or anything like that. Treat it as a third
party system. And also don't try and instrument other people's code. I've seen this happen. People will go, right, okay, we've got this third party and it's
this third party code that we deploy. How are we going to instrument that? Do not instrument the stuff that is there and then accept that you're going to get to a point where it's going to roll over to logs and
metrics. If the tool that you're using allows you to be able to connect up logs and metrics to your traces, that's really handy because remember in these big organizations, you might never reach the golden
sunlit uplands of traces for everything. So you're always going to have to go back. You can think of it sort of like fast travelling through the infrastructure as that you are not going to be able to get to the point where
necessarily you're going to be able to get into the point in the Oracle database that you're really trying to kill that has actually had this problem. But you will be able to fast travel to the bit in the code where it makes a query to an
Oracle database and then you'll know which logs and traces to look at. So the goal really is for wide coverage, especially of hot paths. And that brings us to another thing which is culture change. So you've been working on this for maybe
six months or so. It's fairly short projects. You've gotten traces. You've got end to end on many of the request pass through the systems. People kind of get observability now. So those three people will come out with a few others and build an observability
engineering team, right? I would say that for most organizations, that's the wrong way to do things. There are companies for which observability engineering having separate teams stood up for that kind of thing does make
sense. But for most places, you're really going to be looking at creating this kind of. This is one of my favorite slides ever, which is weird. I have weird favorite things. But this talks about like a DevOps transformation where what you do is you create a DevOps team and the best DevOps team disappears after like six or
twelve months because what it's done is it's created this thing where they come together. And you should, you know, this is a valid way of doing things for observability as well. Ultimately, you may have an enablement team. However, instrumentation should be being done by devs as part of their day to day work
The tooling needs to, oh five minutes. Oh, slow down. Enablement should be sharing best practices and doing training and stuff. The tooling really needs to be absorbed into an existing platform team. And this is the really cool part is that now, if you think about it, you've gotten to this point where you've got all this instrumentation into your codes. You can start thinking about what kind of tooling makes sense for your organization. Whereas when you started, it's very hard to do.
That wasn't five minutes. Okay, I'll stop. So, yeah, if you've done your job well, hopefully these people won't need you anymore. And you can go and absorb back into teams and you can call that project complete. You might be able to do some kind of enablement team. But as I said, that wasn't meant to have a question mark at the end of it.
Go and effect change. I'm going to end it there. There wasn't much time. I've got so much more I want to talk about on this subject. So I might do a follow up thing. If you have any questions, I'd be happy to answer them and you can find where to find me at that website.
Thanks, James. We have still five minutes for Q&A. Some questions here.
Okay.
There is one.
I answered almost everything.
Hi. How long has it taken for you to convince a big org and an old company to move from no observability to some sort of observability?
Completely convinced. I'll let you know.
So I maybe joined back in May with this client and was helping them with a previous project that was getting wrapped up.
So I'd say it's been eight or nine months working on other things and identifying this as a need where it's been working through.
Yeah, it can take time.
So because you've got all, as I said, you go back to that stakeholder slide.
I could have spent a whole 20 minutes just on that because you've got to kind of get everybody aligned.
I've done lots of like meetings. I've shown the people off, shown things off to people.
I've shown off all these slides and stuff, gotten everybody on board.
And yeah, so I think by the time, you know, I'd say that everyone's actually in lockstep, probably about now actually.
I should say though, by the way, is we didn't, you know, just not do anything until that point.
So there's been lots of opportunities to like seed things as we've been doing other work as well.
So yeah.
All right. Some questions? No, then thanks, James.
Thank you.
Thank you.