So, up next, we have Christos and Alex and unifying observability in the power of common
schema.
Okay, thanks everyone and welcome to our talk.
We will in this presentation talk about the conversion story of two schemas of open telemetry
in the elastic common schema.
But let's first introduce ourselves.
My name is Alex.
I'm leading the open telemetry initiative at Elastic and I'm a co-maintenor of the open
telemetry semantic conventions project.
Hi, I'm Christos.
I work on elastic as well and I'm software engineer focusing on observability and specifically
open telemetry where I am a contributor and a prover on the semantic convention project.
Okay, we would like to start with a quite easy and simple question.
How many of you do know exactly what open telemetry is?
That's great.
I can skip some slides later.
How many of you do know what semantic conventions is about?
That's what I expected.
And how many of you do know what elastic common schema is?
Okay, thanks everyone.
So let's deep dive a bit on the history of open source tools and standards in observability
to give us a picture where the standards come from.
Let me.
Okay.
No.
Does that work?
Okay, around, do you hear me?
That works well.
Okay.
Around or a bit more than 10 years ago when microservice emerged that also changed the
observability market and industry.
That's when like big tech companies started building their own open source tools for collecting
observability data.
So tools like Zipkin, Jega for distributor traces emerged, the Elk stack for logging,
Prometheus for metrics.
We heard a lot about this in previous talks.
And based on this defective standard tools, then actual standards emerged like open tracing,
open sensors later for distributed tracing, open sensors also covered metrics and the
open metrics as a derivative of Prometheus format emerged and Elastic has its own ECS
that defines the semantics of structured logging data.
Since we will talk a bit more about ECS, a quick introduction what that is.
So ECS stands for the Elastic Com Schema and it's basically just a definition of a set
of fields that describe the semantics in structured logging data.
So for example, if you're collecting a service name with your observability data, the Com
Schema tells you that you should put this value into a field that is called service.name,
not app.name or application.name.
So you have common names that you can later on search for and this also allows you to
correlate data across different signals.
Now as you can see, we already have at least four standards here that are partially competing,
partially complementary.
Plus we have all the tools that also create some defective standards for collecting data.
So it's ridiculous to have so many standards, right?
We need one more that covers all of them.
And usually what happens is we have one more that is competing with all the others.
And yes, we have one more standard for observability.
OpenTelemedia will come back to the comic later again.
This is the slide that I can skip based on the Paul.
So OpenTelemedia provides not just a standard but a full ecosystem and framework for observability.
For collecting data, sending it protocol.
One thing that I want to highlight here, there is a specification in OpenTelemedia that defines
what data you can collect, like traces, metrics, logs.
OpenTelemedia working group is also working on a profiling signal.
And what we will talk more about in this presentation is the semantic conventions.
Semantic conventions are very similar to what I've shown for ECS.
And basically defines, yeah, attribute names and their semantics.
Let's have a concrete example of how the data structure in OpenTelemedia looks like here
with some logging data.
Very simplified view here, it's a bit more complex.
But let's say we have a set of log records, right?
The OpenTelemedia protocol defines like the core structure of that signal with fields
like severity text, which is basically the log level and body, which is basically the
log message.
In addition, you can collect with your observability data additional context information.
This is usually represented in so-called attributes, and that's where semantic conventions come
into play.
The semantic conventions define which attributes exist, their names, types, and also the semantics
behind this.
For example, if you're collecting an HTTP access log, right, and you want to capture
the HTTP request method, this is the attribute name that you would use for it.
Now observability data is usually also captured in a broader context for some resource like
a concrete service, a host, or other resources.
That's why OTLP wraps the actual observability data into a resource wrapper, and a resource
again has a set of attributes, so-called resource attributes, that describe the resource,
something like the service name, host name, and so on.
So this is the structure in OpenTelemedia for collecting observability data, and semantic
conventions is just about the attributes basically in their meaning in this data.
Now let's come back to our timeline of standards.
There's one important thing I didn't mention before.
Actually OpenTelemedia, and we heard this in the previous talk, is the result of a merger
between open tracing and open sensors.
OpenTelemedia also supports Prometheus metrics and OpenMetrics that we have heard in some
of the previous talks, and just last year, Elastic also announced the donation of ECS
into OpenTelemedia.
So coming back to this, the question is, is it really that we have one more competing
standard?
I would say actually not.
With OTLP we have less competing standards, and OTLP really succeeds in reducing the amount
of competing standards and becoming the one and single standard for observability.
Now as I said before, Elastic announced the donation of ECS into the OTLP's semantic
conventions project.
Why?
Yeah, because there are great benefits to this.
First of all, there are complementary parts and strengths in both schemas that we now
merge into one single schema.
And second, we grow two different communities by merging them and providing a bigger network
effect.
So it's a huge win I think for the community, but there are not only benefits, there are
also challenges, right?
First of all, the overlap between the two schemas is a potential for schema conflicts.
And to resolve these conflicts might mean that we need to have either breaking changes
in the one schema or in the other.
We have seen the structure of observability data in OpenTelemedia, which consists of the
protocol with the nested structure plus the semantic conventions.
It's quite different to how ECS defines the fields because ECS is just a plain definition
of fields without like nested structures or so.
So there's some difference resolving that is a bit of a challenge.
Another interesting thing that we discovered when we started merging ECS is that in OpenTelemedia
before the merger, many times attributes have been defined in a concrete context.
For example, we have here an HTTP server span and the attribute HTTP route is basically
defined under the semantic conventions for HTTP server spans.
The problem is now if I want to use the same attribute in a different context like let's
say HTTP access logs, I mean there was always a means just to reference the other attribute,
but it feels sort of weird because in the one context is a first class, right, attribute
and the other one is just a reference that overrides some semantics.
So learning from ECS, what we already achieved with the merger is that now we have in OpenTelemedia
a dedicated attributes registry that serves the case of just defining attributes with
their types, with their meaning and in the different semantic conventions and their use
cases we are just referencing those attributes.
So we have clear separation between defining attributes and using them in a concrete context.
And finally another challenge is metrics.
Metrics formats in OpenTelemedia follow the TSTB model.
So we have a concrete metric name like system disk IO in this case with a type, with a unit
and we have a set of dimensions modeled as attributes.
In this case direction for example for disk IO read or write.
In ECS previously the metrics were basically modeled as numerical fields on documents and
you can have multiple numerical fields in the documents so you can have multiple metrics.
That's the reason why often some of these dimensions that we have in OpenTelemedia are
just encoded into the metric name on ECS side.
So we have things like disk read bytes or disk write bytes.
This is quite a big difference in modeling.
This is a case where we are learning basically from OpenTelemedia and adopting this at Elastic
now also with Elastic Search supporting TSTB.
So we see we are learning from both sides which is a great thing and we are coming to
the best solution possible for the community.
And Chris will tell you how this actual merger is happening in practice.
Thank you.
Can you hear me?
Okay.
So as Alex mentioned there are a lot of things going on so the question is when is time to
celebrate the merger that everything has been completed and the truth is that we are
not there yet.
There are things that needs to be done and actually everyone believed in the beginning
that once the merger was announced that that's all.
I mean we have not anything to add there but yeah the truth is that the actual work started
right after the merger was announced.
So yeah let's see some examples of how the merger is happening and how things are moving
forward.
So I have some real examples here from the upstream repository on GitHub with issues and
pull requests.
So this one for example is trying to add some new resource attributes for the container
images and specifically the digest of the image.
So as we can see that PR was filed on the 4th of July I think yes and it took it some
time to get seen right.
So it took us like many review cycles more than 20 blocker comments actually there so
lots of back and forth lots of discussions but that one was actually merged after almost
two months.
And another example is about a very important attribute the IP of the host hosted IP as
we call it and this one was really unique really interesting actually because this PR
was filed by a non ECS contributor.
So actually that contributor used to work for a company that it's I would say completely
unrelated to the ECS project but it was quite nice because in that case the existence of
the ECS project was taken into account and there were very interesting conversations
and it took us like almost three months to have it in.
So yeah it's quite obvious with these examples that the merger was not something trivial
not something straightforward that can happen from one day to the other by for example writing
a script that will transfer everything from one project to the other or something like
that.
So we have decided to take an approach to move let's say not so fast and pay attention to
the detail and have the proper people work on specific areas so as to leverage their
expertise and be sure that what we are merging to the up seem to the final project which
is actually the sematic convention of open telemetry will stay there and everyone will
be happy with that in the future.
So that's more or less the areas of the sematic conventions.
We have areas in area about databases cloud containers Kubernetes HTTP system metric
system resource attributes and many others.
And yeah so we have started focusing on specific areas some examples is the effort that we
are doing on the system metrics area we have a working group working there focusing on the
stability of the area.
We are in a really good position now we are moving towards the ability really soon and
the same for the process namespace the process area the process resource attributes and the
same for container area we are close to achieving the 100 percent converges there the recent
going PR that will add the final attributes final metrics excuse me same for HTTP and
network areas we have good coverage HTTP sematic conventions were declared as stable really
recently so we are adding on top now which is quite nice and yeah we have work in progress
in databases mobile areas cloud Kubernetes so we have working groups getting started
and focusing on these areas and yeah over the past months we are focusing on making
the project as good as possible it's a community driven way so we as ECS contributes to the
contributors donating this project we are not only focusing on the merger itself but
we want also to ensure that the sematic conventions project will be there and will can serve us
in the future so we are also focusing on other things as well like improving the tooling
of the project working on the guidelines this is quite important because there are many times
that the guidelines of the one project are in conflict with the guidelines of the other
projects so in that case we need to take a step back and reconsider the guidelines and
see what we want to have there as a final result and yeah also we work on restructuring the
project before it was the sematic conventions within the project were grouped by signal
logs metrics traces and so on but now we have a better organized organization there and we
group the attributes by topic and yeah as Alex mentioned already we have introduced the
global attributes registry it's actually a very big list with all the attributes there and then
within the actual specification you can reference the attributes from there so yeah that's quite
useful and we're also working on adding a new concept from ECS which actually the attribute
nesting or reusing some namespaces that means that if you have a namespace for example always
dot whatever you can nest it attach it as it is under the host namespace for example and you
don't need to redefine it again so yeah these are some examples from the upstream most of them are
closed some of them are really let's say close to be completed but we have some small blockers
there but the work is moving forward that's a that's the point and yeah how the community is
organized around these so as I mentioned before we want to have proper people working on specific
areas leveraging their expertise so we have working groups working on each area and we're
trying to first declare their the areas of the semantic attribute the sematic conventions as
stable which means that all the semantic conventions that we will have there will be
stable and then we can use them in the actual implementations so the next step is to tune
the implementations accordingly which means essentially the open telemetry collector and
the language SDKs and yeah some examples the system metrics working group the working group
around databases we have a security semantic conventions working group which is getting
started now we have also approvers areas for the mobile area containers Kubernetes and many others
that I don't mention here and the process looks like this first once you want to create a working
group or a specific project you propose the working group area and you mentioned there what
issues you want to work on and then you will have people expressing their interest to join this
effort you will need to find a sponsor from the technical committee and yeah once everything is
decided we have a specific project board we have regular meetings we have people getting assigned
to the issues there and yeah the work is happening like this and yeah regarding the merger itself
in yeah technically it happens like this we follow this process so once we have to either
introduce some new fields some new semantic conventions or we want to move something from
ECS to the semantic conventions of open telemetry we first check obviously what we have in these
two projects and we also check what implementations have so far essentially the open telemetry
collector or the SDKs because there are cases that the for example the collector already
uses some some let's say metrics there or some semantic events some resource attributes for
example but those are not yet part of the semantic conventions of open telemetry so in that case we
also check what there is there so we might find something interesting so we can use it and once
we have everything considered we have a final proposal we raise an issue or a pull request
directly and we start the discussion within the community we yeah particularly focusing on
measuring the breaking changes because you can imagine that we want to avoid bringing frustration
to our users on both sides so yeah that's really unique really important thing to consider and
we go through the review process and then once we have a conclusion we merge and then of course
we need to handle the breaking changes because they are there most of the times and yeah the summary
for today is that the merger is happening feel free to join us contributors are more than welcome
everything happens in the app stream so if you are interested please join and you will see that
you will find that you will have real impact from day one there and the goal of everyone is to make
the semantic convention of open telemetry the one unique straight one unique and straightforward
standard for observability and security that will be there for the future so yeah with that you
can find us on csf slack channels or by using our github handles and some project meetings on
Mondays we have the semantic events working group meeting same our next day Tuesdays we
have the specification sig meeting and on Thursdays we have the system metrics working group 530
30 central time and yeah without any questions I think we're out of time
do we have any questions
hi thank you for the talk this this was really interesting and clarified some things for me
I have one question about what's how what are the benefits of these semantic conventions in terms
of like front-end tooling that that we are using because I know that you know there's this idea in
open telemetry project that you have semantic conventions and you have common attributes
for different signals and then we collect all this data in all these different signals in some
observability tools and I imagine in like front-end we could automatically correlate different signals
if we have this like common attributes I'm not up to date with the current state of this this area
so yeah this is my question what are the main benefits of following this semantic conventions
yeah I would say there are two actually one is I mean open telemetry is an open source standard
right and there are many vendors adopting this so we need common semantics of what the data
represents to build features higher level features on top this is the first thing and the other one
is correlation as you already mentioned cross like different signals to also have correlation
cross or through the resource attributes for example so you can drill down basically on
different signals into the same resource and yeah I would say these two things and also cross
signal correlation not only through resources but things like trace ID to have them you know both
on locks and traces and later maybe in profiling data this kind of things okay thank you so are
you doing something like that in elastic like in front-end at the moment is there any work going
on in this area like correlation of different signals yeah of course like I think that's that's
the goal for for every observability vendor to bring all all these different signals together
yeah okay great thank you very much any other questions going once
okay cool then bingo plus
okay