Great. So good afternoon everyone. Next we will have Donnie Burkaltz introducing platform
engineering for dummies. Thank you. Super excited to be here today. It's been a number
of years for many of us since being at a Fosdham in person, so welcome back. I was very happy
to be here. I got myself a very nice Belgian beer as soon as I arrived, so I'm feeling great
right now, all ready for my talk. Only one now, just one. The rest will be later. And
I hope I'm assuming none of you are actually dummies, so thank you for coming to this talk.
This is just for people who have heard the term platform engineering. It's starting to
get increasingly popular. It's the only thing people talk about besides AI these days. We're
going to mostly skip that one. And we're going to talk about what it is, how vendors are
completely destroying the term, just like they do with everything. And then how to get
started with it yourself. How you really make it as easy as possible. You don't have to
buy vendor solutions. You can use open source off the shelf software. It doesn't even have
to be custom and brand new. So by the end of this talk, you'll have a really good sense
of platform engineering, at least as good and as deep as you can get over the course
of the next 12 or 13 minutes. You'll have a lot of good resources. I've got links in
here and a couple of the slides as well. So you can go check those links out afterwards
because it's not just about technology. It's also about the people. It's also about the
process. There's a lot of different pieces you have to do to get this right. In fact,
the technology in many cases is the easy part. But first, a very short story. A few years
ago I worked as a technology leader leading a DevOps transformation at the time. That's
what we called it. We now probably call it platform engineering at this travel tech company
called Carlson Vaganley Travel, CWT. It was actually an office here in Brussels. I visited
there a few years back. Great place. Lots of interesting development happening there.
Since then, I actually have led products, management, and products at Docker and at
Precona around open source containers and databases. I've spent a long time in the platform space.
Long story short, I know what I'm talking about. I've been doing platforms for like 20 plus years
at this point, as have many of you. I'm just sharing my own story and my own perspective here.
I'm sure many of you have your own. When we think about platform engineering, or at least the way
I look at it, there's really three key pillars to it. There's platform operations, platform as
product, and self-service for developers. We're going to jump into each one of those pillars
and talk a little bit more about what that means. If you want to check this out afterwards,
I have my own independent analyst from my little blog post about it. Feel free to check that out
at your leisure. What does platform operations mean? There's a lot of companies today. In fact,
how many of you come from a large enterprise? Do you have something called a platform team?
Does it maintain maybe Linux OS, maybe some other OSes that we won't talk about,
some things like that? It just got called the platform team at some point. It might have been
the OS team. Before that, maybe they merged it in with the network team or something else like
that. When we talk about platform operations, we really mean operating it as a holistic platform
regardless of how many servers, how many VMs, how many containers might be underneath it.
The same thing we talked about 10 years ago with Cloud, the same thing we talked about five years
ago with DevOps, moving away from that Pets mindset into the Cattle mindset, moving away from that
single server, single container, naming things after our favorite characters or our favorite TV
shows into that mindset of these things are fungible, they're disposable, we operate them as
applications and fleets of things and they're automatically created and deleted on demand.
We're in this world of SRE now, we're moving more and more into things like SLOs of how do you
monitor the user impact of the applications you're serving. In this case, we're talking about
platform engineering, meaning building for developers, but even if you're serving internal
developers, a platform, you still have to care about the quality of service that you're giving them.
You still have to care about your latency, you still have to care about your error rate,
you still have to care about how much of your capacity you're using in any given moment.
You have to treat those internal applications just as importantly as you treat the ones that
you're serving to your external customers and users. A lot of companies don't do that, they'll
have things like their tier one applications, those are business facing, they get major incidents,
spinning up war rooms and all that kind of thing when there's an outage, but if their CI pipeline
goes down they say, oh well, it'll be back eventually, it'll be fine, we can just have our
developers kind of doing nothing for most of a day, no big deal. A lot of companies are still like
that, but we have to apply this platform operations concept not just to our external customer facing
applications, but treat our developer productivity as something business critical in its own right,
because developers are expensive. Sitting there for a day, not being able to ship software is expensive.
And so we went through exactly this journey at CWT. One good example of this was we started by
monitoring tens of thousands of different infrastructure metrics, right, classic old school
world of monitoring, and we shifted that into just a handful of user facing impact metrics,
but along the way we actually had to educate our developers and our operations teams on how to
debug things in a much more complicated way than they were used to, because with the infrastructure
metric you could have a simple runbook. You see this thing, you push this button, done, whereas if
you have a metric saying my application is slow, there's a lot more potential causes, a lot more
you have to learn to jump into it, and so at the same time we made this transition with technology,
we also had to upskill a lot of our level two operations teams and had them become an SREs
in their own right learning how to automate things, learning how to debug things much more deeply.
Now the second piece is platform as product, and when I say this what I mean is
for things like your internal CI pipelines, for things like your container services,
whatever other internal developer tools and services you might have, you have to apply
the methods of product management to them. You don't have to have a full-time product manager,
that's fine, if you do fantastic and you're lucky and fortunate and congratulations on that,
but if you don't, there's a lot of different people who can pick up some of that load,
learn how to do modern digital product management, right, you might have people even
depending on how traditional your company is called service managers, right, they might use a
framework called ITIL to talk about things, and those people still have the potential to modernize
and move forward and get with the times and apply modern product management approaches, meaning
talk to your internal stakeholders, understand the problems they're trying to solve, right,
in many cases they might be providing a service like source code management is a service you
provide to your developers, there's probably a team running it inside your company if you're at a
big company, do those people talk to their own developers about what problems they're trying
to solve and what their workflows look like, Jets are probably not, they just shove stuff at them
and say good luck, right, and we're fortunate we now have better tools than we used to,
but there's a lot of opportunity for people in those positions of being these central platform
teams or central developer productivity teams to go talk to their own developers about the
problems they're trying to solve their day, understanding their pain points, and bringing
that back in. At the bottom I've shared a handful of links in varying levels of depth that are super
good resources if you're wanting to learn this or if you wanted to share these with other teams,
there's an entire specialization on Coursera that'll probably take somebody six months to go through
maybe an hour or a few hours a week, there's a great book by the same person who put together that
course or the series of courses and then there's a website you can just go read for free to start
checking it out right now. In every one of those cases they aren't written for Platform as Product
People, they aren't written only for internal product management, they're written for anybody
doing modern product management of how do you get that up to speed and so you have to do a little
bit of extra work to think what does this mean for me specifically, but all of you are smart people
you can figure that out. Applying this Platform as Product approach is absolutely critical to doing
Platform Engineering right and nothing about this requires a specific piece of technology,
nothing about this says proprietary versus open source, this is the people and the process side
of it, but you have to get this to get Platform Engineering right because if all you do is say
oh hey we gave you a platform now we've got Platform Engineering, you're wrong. What probably
happened especially if you're at a big enterprise is you still have a ticketing system somewhere
and you're still requiring developers to go file a ticket every time they want access to some new
resource whereas if you're getting Platform Engineering right you're moving away from
that because you talk to your developers, you've understood their needs and you've probably moved
into something much more policy driven where there might be an initial ticket but the only thing
that happens is to assign the developer a role as I'm working as a developer or I'm working as a
developer in a certain application area then they're granted that policy driven access and they're
able to move on and get on with their lives instead of every single time they need access to a new
server every time they need a VM created every time they need additional memory provision to the VM
right all these things are crazy and in many cloud environments they have been partially solved
but a lot of us are still working on premises we're still working with servers in data centers
or in colos or working in clouds that feel like we're that in every one of those cases this is
an opportunity to make dramatic improvements in our own productivity as developers
um one example of this from my own experience at CWT was we applied this approach to a really
novel area which is um one of the teams that reported me to me was the major incident commanding
team right so every time stuff got really really bad it was like the fire department you'd call
them in they'd run the the issue and run it through to conclusion now that team had to send out a
lot of different communications to a lot of different audiences they had to send things out to
our internal executives had to send things out to all their employees who were being affected by
it we had to send some things out to our customers as well um all those communications were things
that hadn't really changed for a long time we had to get a lot better at them there were all
kinds of complaints that would come in from these different audiences because it wasn't a one-size
fits all approach it was something where but we were sending communications out that way and then
things had gradually evolved very organically there wasn't a clear way to understand who should get
what i mean so we applied these these platform as product style approaches to the communications
going out from the incident commander team and made dramatic improvements by just doing things as
simple as going out and regularly talking to the people who need to consume this stuff to understand
when do they need it what do they need what do they need to understand so they can turn around
and make the right decisions or do their jobs more effectively or tell their own customers
the people who actually pay us as a company what we need to do and what they need to do
and how long they might need to wait and when to try back and what their alternatives might be
what was interesting too is we did this in a very lightweight prototype sort of fashion right so
of course we had a technology solution for sending all this stuff out but instead of using that and
using our developer time to sit there and iterate and work their way through their backlogs we
literally just wrote a heavily formatted email by hand and started sending this out and used that
as a tool to iterate on what the product should look like and so we just put together this email
and we'd send it to somebody and say hey like what is this what do you think of this like walk me
through how you're interpreting this what you're doing and by applying that really lightweight
technique of just doing things by hand doing things the rough way before we had to put in the
effort on software development it dramatically speeded up our ability to figure out the right thing
and then spend our development effort building the right thing instead of getting getting it wrong
very slowly multiple times on the way and third self-service for developers this one is pretty
self-explanatory so I'm not going to spend a lot of time on it but really this is the continuation
of that consumerization of it trend right the expectations for user experience in the enterprise
side are very different now than they were five ten years ago and the same history for developers
right developers should not have to put up with really clunky terrible interfaces on their internal
tools anymore right it's been bad for a long long long time but things are finally starting to get
better right things have gone through very ticked-dirgin approaches my own experience at CWT was
you know we came in and we did something called value stream mapping which is a great technique
for anybody who's interested in solving a lot of problems like this where we worked through
a very specific workflow and the one we picked was deploying a new application for the first time
um worked through every single team a request went to every single team that had to touch it and
end up being something like 15 different teams were involved in this because there was a single
silo team for everything you could imagine right there was like a network team and a security team
and a firewall team that wasn't the same as the security team uh and you know the list just goes
on and on and on in large companies like this and every single one of them required a ticket in
some case it was the ticket you had to file in some case there was a ticket that a team filed to
another team and that team filed to a third team and then somebody else would audit it and somebody
else would review it and finally it would work its way through right but imagine getting all those
to a place where you can clearly define the policy once get agreement on that from all these teams
and then work on that policy and use that policy to automate all of your governance going forward
all right that's what we're talking about um we took out of a 45 day timeline to deploy new
app we took 30 days out on the way there um by making some simple process improvements and applying
some automation now let's look at some solutions over the course of the next minute what do you
need from a solution you need a job runner pretty simple because you got to do stuff you need a web
GUI so you can click some buttons you might want to click on it have an API or CLI but those aren't
necessities you need to access controls so that only the right developers can do the stuff you
want to do and of course you need to be floss now there's a few different classes of these job
runners you might look at internal development platforms you might look at CI servers you might
look at workflow and data orchestration tools or you might work on look at task schedules there are
all good options when you're thinking about how do I do this platform engineering and really the
answer here is use whatever you've got don't make this huge start where you are you can use GitOps
you can use backstage you can use even Jenkins you can use workflow and data orchestration tools or
task schedules so hopefully that's given you a sense and I'd encourage you to refer back to the
slides later to see that list because I went through it pretty quick of what platform engineering is
all about what are some of the different solutions and that you should start exactly where you are
today using the tools you have don't make this over complicated thank you