Okay, so the next topic is building community-owned data confidence fabric with distributed ledgers
and smart contracts.
So let's give them a good welcome.
Hi, everyone.
Thanks for staying.
So today we're going to be talking about data confidence fabrics.
Here's the agenda for the day.
So briefly, first of all, I refer to my colleague Sean here as an academic who works for University
College Cork.
We've been collaborating on this project, so I'm going to start by quickly going over
dense open source support and community contributions.
I'm going to go over Linus Foundation Edge, how it started, and I'm going to be focusing
on Project Alvarium, which is part of Al of Edge.
Then I'm going to hand over to Sean.
He will go over a few use cases of Project Alvarium.
And at the end, we have a short demo that we're going to show you.
So the team I work in, we do research mostly in European research projects.
Most of the projects, all of them are funded by the European Commission, multiple programs,
mostly Horizon Europe program.
We collaborate with a lot of the consortium's basically a lot of partners across Europe,
across all these projects we've collaborated with over 100 partners.
The domains that we work on are mostly edge computing and cloud, storage, the zeroed ledgers,
and quite recently we were working on knowledge graphs.
And a common theme across projects is always energy efficiency and sustainability.
Most of the projects here deal with orchestrating deployment across cloud and edge resource
constraint devices.
I'm currently working in the Clever project.
We're trying to build knowledge graphs that represent Kubernetes clusters
and use these knowledge graphs, try out graph algorithms if we see that it enhance the schedulers
in Kubernetes.
And yeah, the themes are mostly, as I just said, this is just quickly about the projects.
And one thing is that most of the work that we do on these projects, most of the code
is open sourced.
Now Dell and open source, Dell has been working with the Linux Foundation for over the past
decade.
It's a founding member of open program with infrastructure up in HBC, as well as member
of the CNCF up in SSF and you can make for you Sonic and the Yachto projects.
Currently Dell is involved in 43 open source projects, 10 of which are Linux Foundation
projects.
Now in preparation for this, I went to asking colleagues about the individual contributors
across the company, about their contributions, and I learned that currently there's a process
in place to inject open source contributions into my legal, which is our legal compliance
systems, ServiceNow and JIRA, to basically encourage contributions from the company so
that it be logged.
There's also other activities such as events organized within Dell about open source.
One of the colleagues contributed a first time contributing page to just help other basically
contributors to make their first contribution.
And what I learned as well is that the most effective way as of now is across the company
is word of mouth.
So it's driven by interest mostly.
So Dell contributions to the Linux kernel as originally a hardware manufacturer, basically
when Dell needs to develop drivers to make sure that Ubuntu runs properly on a Dell system.
So these drivers are open sourced.
The process goes as follows.
Basically these drivers are basically developed, contributed, and they're pushed back to the
Linux kernel so that they work across all the distributions.
After their first implementation, any patches to make sure they all work properly are also
pushed back to the kernel so that it would work with distributions like mostly working
first on Ubuntu and then as they're pushed back to the kernel so that they work with
Fedora or Debian or Arch or any other distro.
Now the story of ILOF Edge, it started in 2017 when Dell contributed 125,000 lines of code
to Linux Foundation to create AdJax Foundry due to its interest in Edge computing, which
is basically an open source platform for basically processing across the Edge and the Cloud.
This platform basically enables Edge data collection from sensors, communication between
enterprise and cloud and on-prem data centers, and for processing at the Edge.
It has runtimes, messaging substrates and so on.
And it's basically an umbrella organization for other projects.
One project of ILOF Edge is Project Alvarium, which we're going to discuss now, AdJax Foundry,
which what started the whole thing and other projects.
Now to go on to Alvarium, first the problem statement.
So basically the motivation for Alvarium came from the realization that a lot of the projects
that we're involved in, we deal with data sources that are dispersed, the sensors come
from different manufacturers, sometimes sensors are owned by different organizations, and
usually the data is processed as it jumps through the network.
Before it reaches AWS S3, let's say it can be processed at a local server and so on.
So we needed a way to measure the trust of how much we can trust those data points that
are coming from these several data sources.
So this was the origin of the idea of a data confidence fabric.
So the idea is to define what trust is basically.
So as I said, more implications are extensively distributed.
Data traverses across the network.
The idea is to create metadata that attests the verifiability of the data at the origin.
That's the first step.
The second thing is to create metadata describing how the data was processed as it jumps throughout
the network.
At any point the data is touched, we need to inject metadata at what happened here.
And those are the insertion points.
And we need a way to measure or quantify that trust into a floating point.
So we need to put it over 10 and over a certain confidence score so that this confidence score
can be used by the end user to decide how to actuate or do anything with that data.
So to summarize what I just said, there are two things here.
One is a policy defining the measure of trust.
And the other is an implementation that interprets that policy and calculates the trust score.
And the insertion points that I just discussed, so in this figure here we see that first we have the
route of trust, which is the sensor that signs the data when it catches it.
The gateway, the data goes from all these sensors to a gateway where authentication
or authorization happens.
At the gateway, the Alvarium SDK would be used to inject basically the data capture environment
where the data was captured.
Next the data would travel on to an edge server or a distributed storage.
Here like for secure immutable, scalable edge persistence such as IPFS would be there.
And then the fifth injection point would be a ledger where the trust would be registered.
A more concrete example of this or what trust are we talking about those, what are these policies?
So basically at the gateway what we have is if there's TPM on the source device, you get a score of one.
If there's secure boot, you get a score of one.
If the data was registered in a ledger, you get a score of one.
Then at the edge server, more things get added.
If there's an application running on that server that's encrypting the data before it travels, you get an additional score.
And if the signature was verified at this point, you get an additional point.
So this is the idea of injecting, of calculating that trust measure as the data is traveling throughout the network.
And at the end, you would get a score, a confidence score.
If the data matches or basically satisfies all these policies that we set for it, then you would get six out of six.
If there's no TPM on the device, it would get five out of six so that the consumer of the data would kind of know basically that there's something missing.
Or there's something that would lead to a certain issue with the data.
And of course, these are just dummy weights that we're using here.
It was scoring one for every one of these policies.
You could configure this to basically weigh something more than the other and so on.
The different use cases of LVARM.
So one is internal quality and security control.
The second is regulatory compliance.
So to get a very viable percentage of data that it should meet a certain threshold.
Marketplace application, like if you're selling your data or any IoT data, these trust measures would help basically.
Make others trust the data more.
Trusted actuation, if it's a real-time application and your data is going right into a certain actuation.
So you would use this score to know if you trust to do that action based on that measure or that temperature reading, let's say.
And the final one is a trusted ecosystem partner where your metrics could factor into trust ratings by using one's product or service.
There's multiple implementations of LVARM and we'd be happy if anyone's interested to have a look at them.
There's a Go implementation, the Java, Python one.
And Rust one is on the way.
These are the links.
There's the website and all those GitHub repos with examples.
So if anyone's interested.
Now I'm going to hand over to Sean to walk through some use cases.
Thanks.
So I'm Sean O'Murphy from the University College Cork and we're partners with Dell and a whole bunch of other research institutions, universities, industry,
in this European project, the Collaborative Edge Cloud Continuum and the Bettered AI for a visionary industry of the future, which we call Clever because that's too much of a multiple.
Now the idea with Clever is that we're exploring technologies where work is being done down at the edge, up at the cloud, that has been passed up and down,
and decisions being made about what to do with it, and generally the applications are AI and machine learning applications.
So from the open view, we have some use cases with the data confidence that data confidence fabric like Alvarian can give you.
So it allows you to mix and match data from old hardware, new hardware, different firmware versions,
and some of this may be more or less secured, there may be historical information about vulnerabilities that may make you consider some historical data differently than you might if it was up to date.
You also may have an application where you're accepting contributions from a whole mix of different sources.
So rather than a single organization having complete ownership and control over a whole network, you could have accepting public sources,
so you know, Citizen Science, the universe, that kind of thing, where you could have expert users and you could have lay people contributing together.
They could also be using, you know, publicly available consumer sensors mixed with high-end professional grade stuff.
Also, you may be dealing with datasets where you want to be able to be sure that you have permission to use everything within it.
So it could be a case where you're collecting data to train models of it, and you want to make sure that you have the correct licensing for all of the material within it,
so we think annotations could be very useful for that.
So this could save you from getting in difficulties down the line by using things you shouldn't have been using.
So from UCC, we're working with models and data conference for mixed trust applications.
So many applications and models, they like to have a lot of data to work with, so this makes intuitive sense.
You have lots of material, that's more scope to learn things about the domain that you're trying to train on.
However, not all data comes from the same place, so like I said, you'd have different sensors, different firmwares, old hardware, new hardware.
Something might have been installed by a certified engineer or might have been something installed by a home user.
So you can have a mix of all these different sources.
So like I said, some of it you can trust highly, some of it maybe not so much, but a lot of data generally makes for a better model.
So in the illustration here, I just have an example of we have data coming from out of different sources.
Most of it might be low trust score data, so it's the larger red circle, and some could be a little bit more trusted in the yellow,
and then in the green that's maybe the best quality or the most trustable stuff that we are aware of.
Now we could train a model just with the trust score 6 material, which is the smallest, so we could say we're very confident with all the material that goes into this learning.
However, it's the least amount of material compared to the alternatives we have.
So you may get good results or you may get sort of limited results.
And it's dependent on what you're working with and what your application is.
But generally most models will do better when they have more to learn from, but if there is some poor quality or possibly malicious material in your larger set,
you want to do something to avoid that contributing over much to your model.
So we take our trust scores, which the data conference fabric can give you by annotating the data based on its provenance and its history and security characteristics.
And we use this as an additional input into our model.
So most machine-earning AI models, they allow you to weight inputs, either through integers or super sampling or so on.
So depending on your application, you might be more interested in one kind of annotation versus another.
So you might be more interested in stuff that has been signed and gone through like secure socket layer, come from devices with just platform models installed.
Or you may not worry too much about if there might be something like the difference between security and safety, very important.
You want to make sure your material is very good and very trustable.
Something more like optimization, where a catastrophe in optimization is a small loss of money and carbon,
but loss in safety and security could be much more serious.
So we have a case study from our work using data conference as an input and machine-earning.
So we have decided to mix trust.
So we're accepting material from the low trust data set combined with the material from the high trust data set.
But we are using the trust ways to decide how much it contributes to the model.
So we have an experimental set up here where we take an existing data set and a portion of it we poison.
So we actually manipulate it by changing results in order to cause a sort of malicious result in our resulting model.
But because we have annotations to data conference fabrics to work with,
we can make sure that I had a potential to have been altered in this way as a commensurate low trust weighting audit.
So the idea here is to combine the stuff that may have been manipulated or malicious in nature with the stuff that we know is relatively clean high trust
and make sure we have a large data set that has a combination of high trust and low trust,
but we're going to limit how much the low trust stuff contributes without discrediting it entirely.
So our experimental set up here is we have the census data for California Housing in 1990.
And our objective here is can we predict the median value of housing based on the other fields.
So the latitude, latitude, ocean proximity and so on.
And the poisoning here is to adjust the latitude by two degrees northwards.
So the effect of this, if you're a malicious actor and you wanted to wreck a model,
if you applied this to material that was contrary to the model,
what you would do is you would say LA is somewhere in the middle of the desert,
San Diego is up the coast and no man's land and so on.
So the impact on your model is that the predictions for the median values get pretty out of whack.
And so the results here, the blue line is a baseline where it would be the best case where everything was perfect and not manipulated.
And then you can see when we just used the clean set, which is the green result here,
the more and more of your data set that may have been poisoned, the more of it you are disregarding entirely,
which means your clean data set is getting smaller and smaller and smaller.
So once you get past 50%, it starts to degrade because the amount of material it can learn from is getting smaller.
On the other side, then we have the poison data set where we are incorporating the poison stuff with the clean stuff.
This degrades pretty badly as well because the poison stuff is malicious in nature,
it was intended to cause a bad result.
And then the trust result, which is the red triangles here,
these are when we're actually using the trust as an additional input.
So we're accepting the material that may have been poisoned, but it's been weighed more low compared to the clean stuff.
So we get the benefit of all of the clean material, all of the potentially poison stuff,
some of which does have good explanatory power within it because it wasn't necessarily altered,
it could have been no trust but still good quality,
or there are parts of it that hadn't been manipulated that still give you good information.
So from our side in the University College of Cork, we have some future directions in using data conference fabric,
so we think it's pretty interesting technology in a cool area.
So usually in a zero trust environment, you're using this to decide whether or not to use something,
whereas we think there's something interesting to be done in use it a little bit,
or use it as much as is appropriate for the application in mind.
We also think the trust scoring, so we saw you could trust something based on your choice of weightings for certain security features,
but what if those weightings could be something that could be learned on a per application or per organization basis?
There might be some way to do this through iterative modelling,
or feedback loops, something communicating between the edge and the cloud and back again,
based on the actual results of what you're doing.
And we have some interesting in some new types of annotations,
so what you see with a researcher as Khalid is working in this area exploring different kinds of annotations,
maybe there's some new novel approaches we can take here.
And so we have some ideas about the performance of the models themselves being a way to calibrate the weights.
So based on how well your model is doing, is that telling you something about a particular annotation that appears quite a lot?
We have some of our contract demo using Gilverin Harrison.
Yeah, so now I'm going to show a five minute demo.
Okay, so this demo has...
Oh, what happened?
So with this demo I'm using Hedera and Project Alvarium,
so Hedera is a serial ledger that is basically...
It uses the hashgraph, the serial consensus algorithm,
what makes... we chose it because it's consensus in Hedera is much faster than other ledgers,
and because it has public subscribe semantics.
Now, Alvarium works with any messaging, like middleware, like Kafka or Polsar,
or a... let me pause it for a minute.
Okay, it works with anything that supports public subscribe semantics.
In this case we're using a distributed ledger, it doesn't have to be a ledger,
but the ledger here adds some more trust.
So what I just did here is I started a service that created a smart contract on the Hedera ledger,
and it created a new topic on Hedera so that the sensor can publish to that topic.
And the smart contract is used for automated billing of trust services,
or the annotations that are injected into the data as it's flowing,
and the topic ID is for Alvarium SDK basically to publish the annotations that is...
It's adding the data to that topic.
The smart contract that the devices...
So those are the two...
And then what I'll do here is I'll just update the config for the other two apps
to use those contract ID and the topic ID.
Then I will go and run the UI, basically it's a React app
that basically subscribes to that topic to get the annotations and to the smart contract.
Just a second.
So here I'm showing the wallet of the user which has some HBuy,
which is the cryptocurrency used in Hedera,
and the wallet of the trust provider which is that in this case.
And you would have a fee which is registered in the smart contract
that is going to be paid by the customer based on the annotations
that are added to the sensor data that it's using.
And the gas fees are paid to the network for publishing to the topic.
So here what you'll see is the source of the data, the reading,
the auditing, the hash, and the annotations that are added.
So now I'm running a simulated sensor that is generating dummy data
and sending that dummy data to the ledger.
This is the dummy data generated and annotated by the SDK, by Alvarium.
And they're going to be sent over.
And then I'll switch now over to the UI to see them pop up here.
And you would see now that the billing is happening there.
The fee is being automated.
And if you view the annotations there, you would see the scores that are added to the data point.
So I'll view them now and I'll be able to see it.
As you can see here, it's a Boolean where what's the kind of the trust annotation here.
It's TPM first. It's false. You don't get a point for this and so on.
If we look at the wallet, we can see a certain fee transferred from customers account to Dell's account.
So here are they coming in?
I set up another one here where here the billing is happening automatically after every annotation or every data point as it comes in.
But I noticed that the publisher ends up paying a lot of gas fees because every time they're doing a transaction.
So what you can do is pay in bulk in the second small demo.
Basically they start coming in. You don't see nothing going out of the wallet of the publisher.
And then after you fetch a few, you can see the amount you up top and this is how much is due for the pay.
And at a certain point you just pay the bill at once and it's a single transaction.
Account balance has increased by 21.
And that's it.
Thank you guys.
What happens when the button is clicked is...
So if you want to reach out, those are our emails. If you're interested in the project, please reach out.
If you have any interesting project that you think we might be interested in as well, reach out to us.
And if you have any questions, yeah.
Yeah.
So an example of the trust scores, you could also have them not be just zero or one or range to create new ones, right?
Save hang on.
When you were attributing trust scores, you could also configure them to be not just one or zero, but also range in between?
Exactly. Any floating number.
In such a case, how do you verify that the same factors are causing the same amount of changes in trust factors from different sources?
Like if a vulnerability in one source is attributing the same amount of trust as if it's coming from a different source in a different edge node.
So if you're attributing a different number, okay.
If there's one thing that's causing a decrease in trust, but that the decrease is equivalent no matter where the decrease in trust is viewed.
Hello.
My assumption here is that the ways would be defined by the verifier only.
Exactly.
So that is it.
So you have a certain number of policies.
You're coming in.
You want to check that.
So at once, you have all the metadata, all the annotations.
The annotations are just telling me if there's encryption or not, what type of encryption is there.
If there's any other policy there, if the signature was verified and so on.
So at once, I would weight all these based on what I want to weigh them and I would calculate the score.
So the scoring is not happening as it's flowing through.
What is happening as the data, like the points where the data is touched throughout the network, the gateway, the whatever server the data passed through before coming to storage.
All I'm getting is the metadata and at once at the end, I'm scoring it.
Yeah.
So thanks for the talk.
I believe you mentioned that you also use this technology for license compliance and some, I would assume licenses for data sets.
I think this is the first time I hear something about this.
So I'm really interested.
Could you please elaborate a bit on that?
How does it look like, at least on a high level?
So we're very early on on this.
But I think the usual thing with the data conference fabric idea is you can make an annotation for pretty much anything you care about.
So if what you're interested in is I have a data set.
I want to verify which of it has been either say properly licensed.
Well, there's going to be some sort of process to verify those licenses.
And that's a step that if that step is complete at the time that it's requested, that's an annotation that gets added to the history of that piece of data.
And then at a subsequent point, if I'm trying to create and collect the data set from all the material that I have, I can say just give me all the stuff that had that annotation associated with it.
Now I have a data set that I know at the time of Providence that it passed that particular check at that time.
Now, how do you go about that check?
I mean, that's up to the developer, I suppose.
But the idea here is that the data conference fabric gives you the framework to approach things like this.
And so I think this is very important when you're collecting things from a mix of public and private sources.
It could have come from many different paths to where you are.
You want to make sure that everything's above board and you don't get yourself into a sticky situation using things that you shouldn't have been using.
And also we're starting to see tools like Nightshade and Glaze that are attempting to make that stuff worse to work with for very good reasons.
And so you can make sure that you're using good quality material that you have the permission to use.
And that means models that you've been training on that stuff you know as well that these are all above board and fair to use.
Yeah, just something to be clear.
If I well understood, it's not you who are providing the scoring, right?
So the framework gives you a way to have the annotations up in a ledger or a subscription broker or something like that.
Trust scores, they can be generated and added to your ledger or they can be computed at any time that a decision is being made about that piece of data that could be happening anywhere.
So you can request the scores that have been pre-computed or you can make your own trust score algorithm.
And in that case you just request the annotations associated with all the material you're dealing with and produce your own scores.
So there's flexibility there and you can have trust scores based on a per application basis as well.
Okay, and so I have a question about smart contract and Mario in your blockchain use case.
I mean, why to use your contract, your smart contract to only inscribe data on chain?
What is the interest for your database to write the scoring on chain via smart contract?
Why are you using it? You can just like verify in the database.
And also a second question which I think is related if you are really providing verifiable and high level data, can you act as an oracle?
And you can begin an oracle into the blockchain to have secured and already noted annotated data which we can use.
Are you thinking about this type of use case as a oracle?
Okay, so the first question, why are we using a smart contract?
As I mentioned, a ledger and a smart contract are not essential to, it was just the use case here.
So the main purpose of the contract was just to automate billing for any trust services that have been for, if anyone wants to package up this SDK and provide it as a service.
And keep it distributed. They can use a ledger to automate the billing process for the use of those annotate, as per annotation.
So the only reason we're going on chain is to just track those annotations.
But as you said, there's storing on chain is inefficient if there's a bulk of data that's going on chain.
And for the second one as an oracle, to be honest, I'm not quite sure how oracles work. I'm familiar with oracles.
They're also peer to peer, but I'm not quite familiar with how they work. So how, how a very amount would contribute to being an oracle.
So, yeah.
Okay, so no more questions. So again, let's have another round of applause.
Thank you.