Okay, next Dimitris and Geo will show us what CI-CD system observability looks like and
how observability in CI-CD system pipelines works.
Yes, that's right.
And hello everybody.
Can you hear me?
Awesome.
So thanks for having us here.
It's like actually the second ever talk we're having for them because the first one was
three hours ago.
So for you who were there, we really saw it's going to be kind of repetitive.
But like, yeah, let's get started.
So what is CI-CD observability and how to bring it to CI pipeline?
So this is our abstract.
We're not going to go through that because it's huge.
We know so in a nutshell, this talk is about enhancing our pipelines reliability and performance
by bringing observability at every stage of software delivery.
So we're going to answer two questions like how we can identify flakiness and bottlenecks
and try to envision a future of effortless visibility.
We're also going to talk about the pivotal role that OpenTelemetry plays in the thing
we're trying to solve.
And then how we want to shape CI-CD's future and explore the challenges and opportunities
ahead.
A little bit about us.
I am Dimitris, a software engineer at the Platform Productivity Squad at Grafana Labs.
And I'm Gio, a software engineer in the Explore Squad.
Yes.
And Grafana Labs.
Yes.
And with that, let's go through the agenda.
So we're going to start talking about what is CI.
And this is with CI-CD systems.
And then introduction to OpenTelemetry, some under conventions.
Why is it important to own your data?
And then where we are now, practical use cases and what's next.
So we're going to start with a question.
Like what is CI basically for everyone?
And we're talking about the definition here.
So definition of continuous integration may vary depending on who is the author.
So for example, we have a few experts talking about that in two different books.
And they are talking about something similar, but using different words.
The only thing we can make sure about is that, we may be sure about is that, like, the word
continuous is always going to be there.
Because we're talking about something, you know, a never ending feedback loop, something
which continuously run in order to identify flakiness, may help us like, you know, improve
our processes and all that.
So all that being said, we go to the next question, which is yes, but like what is CI
for real this time?
And as we said, CI can mean different things.
Like it's a list of things basically.
And for every person, for each person means different things.
Like it may be a mechanism to reduce repetitive manual processes or like enable better project
visibility in some cases or find the resolved flaky tests and builds and, you know, try
to prevent paging people at 3 AM in the morning because, you know, human hours are important.
So the next slide is how a typical CI CD pipeline looks like in a modern company.
So I think this is the same for some of us here.
So it's like, we start from testing, building to deploying.
And by the way, we know that we haven't talked a lot about CD, but we believe that like CI
CD observability should come as a whole.
There's no CD observability without CI observability basically and vice versa.
So we're talking about deploying to pro like waiting for 3 AM maintenance until we deploy
to pro and then get error or downtime start panicking.
And then when we resolve all of those things, you know, go and add them as assets in our
LinkedIn profiles, like I'm an automation master engineer and all that stuff.
So we're going to the next question, which is the most like important one.
What is CI?
But like for real, real this time.
And what we're looking for is just a word.
Anyone there to guess?
Someone who wasn't in the previous room, by the way.
No one.
You dare to guess?
I would say alerting.
That's right.
So CI is alerting.
So CI and alerting actually serve a common purpose and we should see like CI and alerting
as like one component.
So when a if alerting is integrated into CI, we can catch issues early in the development
process and like effective alerting within CI ensures that threshold bridges and potential
problems are identified before deployment.
So if we see CI is the left shift of alerting, we can have as I said, early detection in
the development or shifting focus to proactive monitoring, preemptive issue resolution and
stuff like that.
As you can see in the picture as well, it's like they need to hold hands like for as long
they are alive, I think.
So for the next slide, so we're presenting a few things about what CI does basically
and CI, we define CI like is the guard in early stages of the development.
So it detects changes, it maintains builds health and constantly monitor system signals.
So CI catches issues before they appear in production.
So alerting on the other side.
So this is not a slide to compare continuous regression alerting.
We just want to show how like tightly coupled they are.
So alerting is our alerting system like in later stages.
So if something slips through CI, alerting should be there to catch that.
But the most important thing is that we have to remember CI and alerting are not two things
working in parallel.
It's different components.
CI and alerting is just one component and we should regard these as a whole.
So we need like for alerting, we need to make sure to create actionable items.
So if something slips through CI, alerting is going to catch that and we should have
like enough run books, documentation or stuff like that to make sure that everyone can
find the solution to a problem real quick.
So where we are now and what is this whole talk about it too?
So observability so far, what we have.
So from manually trying to search through the logs and traces, trying to find the root
codes and going from a GitHub or like GitLab or whatever you're using to your CI vendor
which may be something different to Grafana or any other observability tool you may have,
like trying to colorate to go from one to another for the same error maybe hard.
So we want to make sure that you can, we can like create a centralized way of dealing with
these problems.
So if observability is so late like at that stage of the run stage of our development
and deployment life cycle, we have limited visibility during early stages, difficulty
in root cause analysis, increase, meantime to recovery and you know this is something
that Gio is going to talk to you more about that and also you can miss optimization opportunities
like it's really hard to say what I have to fix if I want to improve my build times for
example if you don't have observability in your CI pipelines.
So how it looks like now when something catches fire, there's this meme, we all know that
meme, we would expect it to like you know someone to go there and like take out all the fires
but instead what happens if you have the testing, if you don't have CI CD observability and
alerting this happens.
So it's too late to go and resolve anything, you may lose money, customers, you know how
it goes.
So if we want to get data out of CI CD we have to focus our shift a little bit to the left.
So we have to focus our shift to build testing and deployment.
This way we can like address issues before they escalate in production environment or
we can enhance efficiency by catching like problems early in the process or have like
enhanced system reliability and robustness and also it is a huge asset when it comes
to cost reduction.
So what is this amounted to is this slide which is my favorite.
So the thing is that instead of having this fire and you know the developer like the engineer
sitting in the middle and waiting until someone gets there to like take out all the fires
we may be proactive.
So we may be proactive and mitigate the fire before it appears and do something about that
problem right.
So we tried to make it easy for people.
So for new joiners, for people who are not familiar with the CI vendor we use.
So we tried to make it easy and neutral for everyone to go and use like CI CD observability
output.
So for that reason we used open telemetry.
So we tried to define standards.
Geo is going to talk to you more about that.
We tried to define standards and certain patterns which are going to be used widely by everyone
no matter what they really use.
And without handing over to Geo to talk to you more about open telemetry.
Thank you.
I don't know how this works.
I'm going to just...
Whatever.
Hello everyone.
So yeah.
Give me a second.
There is.
So first of all, one question.
Here is familiar with open telemetry.
Or of course.
I want to give a quick definition for where it is not familiar.
So open telemetry is a framework designed to create and manage telemetry data such as
logs, traces and metrics.
Of course the definition is way more comprehensive and accurate.
It's on their website and I'm about to take a look.
But for the context of this presentation I would like to focus on two bits of the definition
which is semantic conventions or we can call them standards and owning the data regenerate.
First of all standards.
That's pretty much I think in topic.
When you talk about standards we always think about the way in which we want to know about
our data.
Standards I think are a great thing or defining as a semantic convention and a standard.
A great way of knowing what data you are storing.
What to query for.
What to query for the data.
And semantic conventions in this context can be divided by, let's say in two categories.
And we can divide them by signal type such as logs, metrics and traces or by the area
they belong to such as generic semantic conventions, databases, exceptions or whatever.
There is one thing that we think is missing here which is continuous integration and continuous
delivery.
And we think however that this is the perfect fit for this.
Why?
Let's take a look at three different CICD systems.
GitHub Actions, Drone and GitLab CI.
They tend to call things differently.
I have highlighted here three specific bits which is duration, job names and outcome for
three different pipelines.
The idea here is that the data behind those three pictures is actually the same.
Duration is duration.
The job name, I guess here Drone calls it stage or job, I don't know.
Like there is a different name.
Like every CI tool calls something differently.
Same thing for the outcome.
Status can be outcome, can be, I don't know, I'm not good at English, but it's the same
data.
And by defining standard we can define a single common way of referring to this data.
So what happened was that at some point I wanted to investigate why a test that we had
in one of our pipelines sometimes was taken three minutes and some other times nine.
Which didn't make sense.
There was no code changes.
Like rerunning the same pipeline, the same job would result in different times.
So yeah, I started writing some Go code.
I'm a front-end engineer by the way, so not.
But it was pretty good.
No?
Okay.
It was.
It did the job.
It did the job.
And I was able to get the data we needed from our drone CI system into Lockheed, Ampo and
Minere.
It worked.
We were able to ask the questions we were curious about.
Why that was happening?
We found something.
But before finding these answers, I shared the news with my team.
I guess they got a bit excited about it.
And what I mean by that was that another colleague was asking me to write the same thing but
put the logs into elastic search.
Another colleague of mine was trying to want to get the same things for GitHub Batch instead
of drone.
Yeah, I mean, like, I'm very good at Go.
I don't know.
I didn't feel like there was a very scalable way of doing that for me.
The question here is, how can we prevent ourselves from writing the same code that does the same
thing but for different systems and for different databases?
And it may sound like a silly question.
None of us, I think, uses 16 different databases to put their metrics, logs, or traces in.
The point is that if we ever wanted to do that as a community, like not as a Grafana,
not as me, we would have to write, I don't know, I didn't count them, 80 different things
only for those databases and only for those three CI systems.
And yeah.
No, there was no way I was going to do that.
One of the things I want to talk about is the importance.
That is important.
There is a bit of the definition that we didn't talk about, which is owning your data.
Now when I started thinking about owning my data, what I always thought was having ownership
of the hard drive that I was going to be stored in or having ownership of the machine that
my database was running on.
In reality, I think that this definition was a bit beyond.
This definition goes, I think, in deciding what to do with the data you generate, where
to store it, where to send it.
We can very well be using a cloud provider.
The idea is that I decide and I know where my data and what data I'm storing where.
Open telemetry helps with that.
Open telemetry defines standards, defines a way of doing this.
It's a common language, if you want, to send data everywhere and to get data from basically
everywhere.
Now what we built was, in this case, an open telemetry collector distribution, a very
small distribution that was meant to solve our specific use case.
And specifically, we wrote a drone receiver, which was listening to a web book that was
written by a drone that would receive data about completed pipelines and generate traces,
logs, and metrics and then export them to logitm.com and Prometheus.
The idea here is that we didn't have to write any of the other components.
Those are all open source components that are available.
They work with everything.
We didn't have to write any of that code.
We just had to configure it.
There are some other practical examples, which is Jenkins plugin, another GitHub action that
generates trace data, and our own one if you want the link.
Now in practice, I try to be fast because I think we're running out of time.
What does it mean in practice?
What we were able to do by implementing CSE observability in our own pipelines and in
our own software delivery process and software lifecycle in general was to identify a flake
test that we were having in Grafana that was sometimes popping up.
We didn't know why and what was happening.
By pushing the logs of our CSE system into logitm, we were able to identify where and
when that logline, that test output, appeared for the first time.
From there, we found out that the build number on-drawn, from which we jumped to the GitHub
action that for the first time caused access to fail, which meant that we were able to solve
the actual issue.
It was some code pushed to someone else, totally different test that apparently was,
shouldn't have caused any issue, but it caused this test to be flaky.
Another thing we did was we built our own custom experience of the Grafana.
First one is just a, let's say, fancier way to look at CI logs.
Second one, I think, is a bit more important and that's something that Dimitri was using
in his job, is still using, which is monitoring at any given time which one of the build,
of the branches, as a failing build.
Why that?
Because in Grafana, we have to support quite a lot of different release branches, when
we have security fixes to make or when we want to release all their versions of Grafana.
We need to be sure that our builds are passing because not having them passing means that
we have to spend time first fixing the issue and then doing the release.
By having an alert on these status, we were able to know beforehand when something was
going to fail because the build itself was failing and prepare ourselves to fix it beforehand.
So, what else it unlocks?
Really, a lot of things.
The idea here is that you own your data.
Owing your data means that you can ask the question you want, ask the question that you
feel are right for your use case, which can span from security, flakiness detection, performance
improvements or door metrics.
I don't know.
Honestly, we don't know.
We feel like there are all opportunities here that can be investigated.
But we feel like this is adding a lot of value.
All in all, we liked having everything within Grafana and production metrics and NCI metrics.
Now, what's next?
This is, as I said, very early stage.
Our proposal got accepted by the Open Telemetry Technical Committee.
We formed a working group.
We want to define standards to work on and to move forward.
So, please join the conversation.
We would really like to hear different use cases.
So, yeah, reach out on the CNCF Slack channel.
And the second link is the Open Telemetry Proposal for CI CDO observability.
Yeah.
Thank you all.
Thanks to you.
Thanks, Dimitris.
Do we have any questions so far?
Can you please wait?
We have still five minutes for Q&A.
Okay.
Are there any plans to release this feature on Grafana for public?
How much do we have to wait to use this kind of stuff?
Well, I think...
So, the thing you can do right now is just use Open Telemetry, maybe take what we did.
The thing we did with Collector Create, your own Collector, use our Collector,
use the thing that exists already for GitHub or Jenkins,
and then make sure that you export all the valuable information for you.
And then you can still use Grafana as your visualization tool, of course.
But like I said, plug-in, it's not ready yet.
I mean, all these things came up like a hackathon project, essentially.
So, we are still unsure.
But you can do it yourself.
That's my...
If you set everything up, if you set the Open Telemetry exporter and stuff,
it should be piece of cake.
Any further questions?
Does the collection of Metric itself impact on the CI?
I didn't get the question, sorry.
Does the Metric's collection itself impact on the time execution of the CI?
I mean, if it has a performance impact, oh, okay.
Our doesn't, because the way it works,
it's a long-running process that waits for a CI to complete and listen to a webbook.
So, the whole thing happens after and in...
I think it's even a completely different server machine than the CI system.
So, it's not really... it doesn't impact anything.
It really depends on your CI system and how you collect those Metric's logs and traces.
So, in our case, no, but it really depends on how you implement the collector.
One more?
I don't see your hand.
And that's it. Thank you.