Thank you.
Hello everyone.
Thank you so much for being here.
In today's session we are going to do an introduction to a driving observability on operations in
cloud edge system.
First let me introduce myself.
My name is Victor Palma.
I'm a cloud engineer at Supernevola Systems.
I come from Madrid, Spain and I've been working for Supernevola for more than two years.
So let's move on to the presentation.
First I would like to start with some initial context in order to introduce later what we
are going to see here.
So first what is observability?
Observability is the ability to understand and analyze the internal behavior of a system
by collecting and analyzing relevant data.
That is the dictionary definition.
But in other words it's just the ability to transform data into information in something
that could be useful for us.
So we can have a lot of system logs or of data or number but if we don't provide a
meaning to them it's useless.
So observability has multiple, sorry I just got blank.
It has some advantages like anomaly detection that allows us to identify anomalies or bugs
in our system.
This also provides the ability to do a performance analysis.
So we can identify areas for improvement in our system and finally it's very useful for
decision making.
So we can see the impact of the change that we made in our system in a very easy way.
As the saying goes information is power.
So observability is very, very important in nowadays.
But now I would like to talk about AI because AI in nowadays is everywhere.
You know, a marketing guy's fault.
But it's really useful for observability.
It's really useful for the quick answer is yes, socially.
AI provides the capacity to create more enhancing data analysis, create automated anomaly detection,
create dynamic scaling for our cloud.
For example, if we have more workloads that the usual we can automatically create new
notes or deploy new VMs in our cloud in order to provide more services to our customers.
And finally we can create predictive analysis, analytics in order to predict how our system
is going to behave in the future.
After finishing this part, I also would like to talk about data sorrentia and open source
because I think the most important concept about AI, currently it's the data sorrentia
or the information.
Many organizations truce sensitive data to third parties providers.
Currently these providers are based outside of Europe and we need to wait in order to
bring back the data to our servers to Europe and be more transparent in this way.
So as a solution, the open source is a very good solution for that problem.
And provide more transparency for the cloud and helps reducing the vendor locking in our
infrastructure.
So we are not tired to a specific vendor and we can migrate within vendors every time we
need it.
So what's next?
How can we address all these challenges?
The answer is the one AYOPS framework, the open source solution for eye driving observability.
One AYOPS framework combines open Nebula as virtualization and cloud management tool,
Prometheus and Grafana as metrics and visualization solution and some AI and ML algorithms to predict
and analyze all the behavior of infrastructure.
All the three technologies together creates the one AI AYOPS framework that we are going
to see here today.
So let's go step by step and first we are going to see what is open Nebula.
Open Nebula is the open source cloud platform solution in order to create your own cloud.
It provides the ability to deploy virtual machines in your own private data center,
in your public cloud or even in the edge.
But you not only can deploy virtual machines but also application containers, micro VMs
or even Kubernetes clusters.
As I've said before, one of the features of open Nebula since it's open source and it's
oriented to provide a truly, truly flexibility to the cloud is that avoids the vendor locking
so you can migrate your workloads with between different providers in a very, very easy way.
Open Nebula has a lot of integration with party tools like Terraform, like Kubernetes,
Ansible or Docker.
It also has a built-in tools like Sandstone that is the web user interface and you can
handle all of your infrastructure from there or from the Celi-I and deploy a built-in machines
based on the where, on KBM, LXD or micro VMs in Fightcracker.
Finally one of the most important features of open Nebula is the possibility to expand
your cloud to the multi-cloud or to the hybrid cloud.
You can create on demand resources on the edge in Amazon, Google Cloud, Equinids, just
clicking a button or automatically if you configure that.
So you can migrate workloads to your on-premise data center to on-edge data center of the
public cloud in a very strife way.
So you can deploy any infrastructure with a uniform management for all this infrastructure
and you can run any application in your cloud.
For open Nebula doesn't occur if the host is located in Equinix or the edge or in your
private data center.
The only thing that open Nebula occurs is what is, what VM is running the workload and how
can you access to that VM?
So very handy.
The next section of the one AI ops is the integration of open Nebula with Prometheus.
That integration is based on the Prometheus Sportex like the Prometheus Node Sportex
that is installed in every open Nebula server.
It's also installed in the hypervisor nodes and it combines with the open Nebula Libreth
exporter.
It's a Q-Stume exporter created by open Nebula in order to extract and collect information
about the KVM machines.
And we also combine this information with the, the, our metrics of open Nebula that
it's created itself.
And this metric is, is gathered by the 1D Demon, it's the main demon of open Nebula.
And then it's exported to the Prometheus server through the open Nebula server exporter.
So the, so the next thing is the AI that we, sorry, that we add to the formula.
So we create a bunch of machine learning algorithms and some decision algorithms and implemented
in the, as a, as an exporter to Prometheus.
So gathering all the metrics that open Nebula produce and the exporter produce, we implement
this algorithm in order to predict and to, to, to get how, how improve the performance
of, of your cloud.
So in summary, the feature and capabilities of when one AA ops are the CPU usage prediction
of the VMs of your cloud.
One A ops come predicts the individual VM CPU prediction per hour, the general CPU prediction
for usage for, for your host.
The accuracy of that prediction, a very important value in, in terms of a feasibility of your,
of the, of the system.
And then when AI ops come also suggests where you can place a VM.
So based on that prediction, one A ops maybe tell you to migrate a VM from one server to
another in order to improve the performance there.
Three main policies configuring when, when AI ops.
The first one is the load balancing, load balancing.
It's just the name said balancing the load of, in all your notes is very useful when
you have on premise or private data center and you want to use all your hosts.
The next policy is the resource contention.
So very useful for public cloud environments when you want to use a, a few number of, of
hosts.
And the last one is reduce migration.
That policy very useful when you want to, to avoid migrating VMs between hosts.
The, this, this scenario is very useful for a edge environment where the migration of
people machines between edge nodes sometimes it's kind of a bit done.
So here you can see the architecture of the one AI ops.
It's based on the already existed open nebulizer architecture.
So everything at the bottom, it's already what's open nebulize.
And the layer at the, at the top of the picture is the new one AI ops architecture layer.
So here you can see the modules that we have implemented in order to provide this, this
prediction and then the, all the virtualize infrastructure orchestrated.
So let's do a demo in order to show you how this works.
First we are going to, to go to the, to the open nebulizer system portal in order to show
you, wait, sorry.
Thank you.
I'm so sorry.
Get me a minute.
Okay.
Well, this is the main dashboard of open nebula, user graphic interface.
Here you can see the, the principal information about your cloud, like how many machines we
have or the images or the built on network or the usage of the host.
We have currently, here, sorry.
In this, this is a demo environment.
So we only have two, two hosts with some workload and that these are the, the VMs that we have
running in these environments.
So this is solid to see that we already have some workload in this environment and this
workload is fully random.
So maybe it's consumed a certain CPU depending on, on the time.
So when we install the one AI ops framework and we have a documentation for that, if we
go to Grafana and import the one dashboard, we can see this.
This is the results that we want AI ops generates.
So we can see here at the left, the average CPU predicted per host.
Here we can see the average CPU, the usage per VMs and here the, the, the real usage.
So as you can see, it's closely one value to, to the other and the accuracy of the prediction,
in that case, a 92%.
Here we can see the suggestions that one AI provides to the user.
The first policy is the core optimization policy.
So it's going to, to, to reduce, it's going to try to reduce the number of hosts to the
minimum.
Here you can see that all the VMs, we have five VMs in this demo.
The four VMs are in one host.
Since this host is fully and not a more VMs entering inside this host, the, we have here
another VM, but it tries to concentrate the VMs in the, in a few hosts.
Here you can see the migrations that one AI suggests for, for that, for, for achieve
this distribution.
So it's to get to us to move the VM with ID three to the host with ID one.
It's very, very, very easy to, to follow the instructions.
And then we have the other policies, the load balancing optimization that as you can see,
we have the, the VMs distributed in the two hosts that we have in our environment.
And then the final policy that is the immigration optimization.
And in this case, no migrations are suggested because no, no optimization are found.
But what?
In case one AI produce something, it, it should be up here.
And what?
Returning to the, to the slides.
Well this is the demo that we have just seen.
And closing thoughts on the next step of, of this project.
So the next steps and challenge that we, we are facing in, in one AI ops.
First is implement the virtualization operations in order to apply the suggestion automatically.
Currently the operations are only suggested but not performed in the, in the cloud system.
Then we would like to improve the AI ops distribution as part of the OpenEvola software.
And you need to install it separately.
And finally we will, we will like to expand the functionality to provide anomaly detections,
allocation based on memory prediction and network traffic.
Because we only provide currently CPU usage prediction.
And based on the result of the, of the tool, creates alerts and warnings.
This project is totally open source.
So you can go to the repository on GitHub and collaborate and suggest new features and,
and, and changes.
And finally I would like to, to encourage you to join to the OpenEvola community.
So you can visit the forum and participate in, in discussion with other, with other OpenEvola
users and, and learn and help together in the cloud community that we have created here.
As closing a slide, I would like also to, to, to say that this project is, is founded by
the European Union as part of the Horizon Europe Research and Innovation Program.
So this project is called CONNIS.
I, I recommend you to take a look in, in this URL.
I can espel you if you want, COG and IT.
COGNET.
It's, it's very, very interesting project.
Well, that's all.
Thank you very much for your attention.
So, questions?
Yeah.
Yeah, we use a linear models and a, and a, and a, and a, and a, and a, and a, and a, and
and a, and a, and a vasilian models tool.
By, by a sund.
Sorry.
By a sund model save any models.
And would you be able to share the slides in the forms of website as też in the child.
You can also find it in the repositories as, as yeah here in this repository all the data
models and, algorithm applied are, are explained.
Okay, thank you.
You're welcome.
Any more questions?
Yeah.
I think I'm basically, it was the same question as the last one.
If you could quickly go back to, because you did actually have the model, and then in like a couple of slides before.
Here.
So here, does it explain where the Bayesian are we in here?
It's not in there, because that was the bit I was wondering how the model worked.
Okay, thank you.
That's a question near our side.
Okay.
Thank you.
Perfect.
Any more questions?
You're welcome.
Ah, it's here.
Are you optimizing for CPU utilization or not recorded?
So also, is it that possible to also say, okay, optimize for availability or network throughput?
Okay, he asked about if we optimize for CPU or for other attributes like network or memory.
Currently, in the current state of the project, we are only make suggestions and predict the CPU usage.
But the idea is to implement a prediction based on the network, on the memory, and other keystone attributes that you want to add to the tool.
The idea is that you can change the prediction, the configuration.
But really, it's just a prototype.
So, for storage?
For storage prediction.
Yeah, we are also considering that.
Sometimes, optimality is changing based on the cloud service provider.
Do you also consider this or is this only based on regular hardware?
Or what is your optimization?
Yeah, we are considering that too.
I mean, the way to optimize based on the location of the VM.
It's not the same half a VM in your on-premise cloud than on the public cloud or on the edge.
So, it's based on different policies that we are currently defining.
Yeah, we are considering that.
Any more questions?
Okay, thank you so much.