[00:00.000 --> 00:10.920]  Hi, my name is Roman, I'm Principal Software Engineer and Network Service Technique at
[00:10.920 --> 00:11.920]  Profine.
[00:11.920 --> 00:16.960]  Today I'll tell you how to build a platform agnostic and hardware agnostic secure network
[00:16.960 --> 00:20.640]  of trusted applications on untrusted hosts.
[00:20.640 --> 00:22.240]  We all love the cloud.
[00:22.240 --> 00:23.240]  It's convenient.
[00:23.240 --> 00:28.320]  It enables companies to save money, grow faster and illuminates the need for a ton of work
[00:28.320 --> 00:30.840]  for managing and maintaining our own infrastructure.
[00:30.840 --> 00:32.840]  It simply makes our lives easier.
[00:32.840 --> 00:35.680]  Well, for the most part.
[00:35.680 --> 00:40.480]  Unfortunately, security breaches do happen and they're costly.
[00:40.480 --> 00:47.200]  According to IBM Cost of a Data Breach 2022 report, $9.44 million is the average cost of
[00:47.200 --> 00:52.880]  a data breach in the US, $4.35 million is the average total cost of a data breach globally
[00:52.880 --> 00:58.880]  and $10.10 million is the average total cost of a breach in the healthcare industry.
[00:58.880 --> 01:03.440]  Unfortunately, or rather quite fortunately given the risks, businesses from various highly
[01:03.440 --> 01:07.960]  regulated sectors like financial or medical simply cannot benefit from cloud offerings
[01:07.960 --> 01:10.960]  due to different laws around things like privacy and data protection.
[01:10.960 --> 01:13.960]  But it doesn't necessarily have to be this way.
[01:13.960 --> 01:19.360]  If it is for computing, by allowing protection of data in use creates opportunities to do
[01:19.360 --> 01:22.320]  things which simply weren't possible before.
[01:22.320 --> 01:26.680]  One way to benefit from the collection of computing would be to just simply use the
[01:26.680 --> 01:27.680]  TEs directly.
[01:27.680 --> 01:33.120]  For example, we could use the SDK provided by the hardware manufacturer and equipped
[01:33.120 --> 01:36.280]  with a fixed stack of documentation all we go.
[01:36.280 --> 01:39.360]  It works, but there are quite a few drawbacks.
[01:39.360 --> 01:42.720]  First and foremost, security is hard.
[01:42.720 --> 01:47.360]  Writing software directly communicating with a secure CPU is not exactly everyone's cup
[01:47.360 --> 01:48.520]  of tea.
[01:48.520 --> 01:54.960]  If all you need is a simple microservice application with a small REST API, diving deep into internals
[01:54.960 --> 01:58.360]  of a particular hardware technology just should not be necessary.
[01:58.360 --> 02:02.800]  It takes away the precious time that could be otherwise spent on developing revenue-producing
[02:02.800 --> 02:03.800]  business logic.
[02:03.800 --> 02:09.120]  But let's say we went ahead and developed a secure layer interfacing with a particular
[02:09.120 --> 02:10.120]  CPU technology.
[02:10.120 --> 02:12.480]  Well, now we have to maintain it.
[02:12.480 --> 02:17.960]  Now apart from that, we also have to fix any bugs while having it reduced and hoped that
[02:17.960 --> 02:20.920]  none of them are exploitable.
[02:20.920 --> 02:24.760]  People make mistakes, and the more code there is, the more opportunity there is to make
[02:24.760 --> 02:26.760]  one.
[02:26.760 --> 02:30.600]  After putting all of this work in, now imagine that you want to switch to a different service
[02:30.600 --> 02:35.720]  provider, which does not offer the same hardware technology you've used originally.
[02:35.720 --> 02:40.160]  Or much more concerning, what if vulnerability is discovered in a particular hardware technology
[02:40.160 --> 02:41.880]  you developed against?
[02:41.880 --> 02:46.360]  The different trust execution environments just are not exactly compatible.
[02:46.360 --> 02:50.800]  So your level of just two choices really is either wait until the vulnerability is fixed
[02:50.800 --> 02:54.640]  and hope your application is not exploited in the meantime, or you go ahead and redo
[02:54.640 --> 02:59.600]  all of the work you've already done for the original technology for the new one.
[02:59.600 --> 03:05.280]  Last but not least, chances are that someone had already done this before, and fundamentally
[03:05.280 --> 03:08.160]  the concepts that make systems secure do not change.
[03:08.160 --> 03:13.520]  So most likely you're going to just repeat the same work someone else had already done.
[03:13.520 --> 03:19.760]  At Rofin, we are custodians of the NRX open source project, which among other things is
[03:19.760 --> 03:23.080]  designed to address exactly the issues I've just outlined.
[03:23.080 --> 03:28.600]  It's a thin, secure layer of abstraction in between the host and the TE.
[03:28.600 --> 03:33.600]  It's essentially a secure runtime, which lets you execute your WebAssembly workloads inside
[03:33.600 --> 03:35.880]  arbitrary trust execution environments.
[03:35.880 --> 03:42.240]  NRX has supported various backends, today that's Intel GX and AMD Cells and P, but as
[03:42.240 --> 03:47.280]  more and more TEs are made available, support will be added for them as well.
[03:47.280 --> 03:52.320]  NRX project was started in 2019, and in 2021, Rofin was founded, which was committed to
[03:52.320 --> 03:57.480]  being 100% open source and providing services and support for NRX.
[03:57.480 --> 04:01.080]  In 2022, we also launched our enterprise products.
[04:01.080 --> 04:04.800]  So now why WebAssembly?
[04:04.800 --> 04:05.800]  It's polyglot.
[04:05.800 --> 04:11.720]  It's supported by languages like Rust, C, C++, Go, Java, Python, C Sharp, Java, Ruby,
[04:12.200 --> 04:14.080]  and the list goes on and on.
[04:14.080 --> 04:17.440]  So it's designed to be portable and embeddable.
[04:17.440 --> 04:22.280]  It has functional equivalents to a usual native binary, so for the most part, development
[04:22.280 --> 04:26.920]  process is exactly the same as for developing any other application.
[04:26.920 --> 04:33.000]  There are emerging system API standards, called WASI, to which, by the way, we also contribute.
[04:33.000 --> 04:37.960]  You can run NRX outside of TE for development purposes.
[04:37.960 --> 04:44.520]  It runs on Linux, Windows, and Mac, both XA664 and ARM64 are supported.
[04:44.520 --> 04:49.760]  Trusted execution is currently only available on XA664 Linux.
[04:49.760 --> 04:54.880]  For SGX, you'll need a recent kernel, and a few Intel provider services running, like
[04:54.880 --> 05:02.280]  ASMD and PCCS, and for AMD Cells and P, all you really need is, unfortunately, a recent
[05:02.280 --> 05:04.560]  kernel with a patch set provided by AMD.
[05:04.560 --> 05:09.440]  So the patches are not mainline yet, but we also maintain our own kernel tree with everything
[05:09.440 --> 05:12.400]  you could possibly need for this.
[05:12.400 --> 05:16.760]  Now let's see how is NRX actually deployed.
[05:16.760 --> 05:18.960]  On the left here, we have a tenant.
[05:18.960 --> 05:20.400]  Let's call her Jane.
[05:20.400 --> 05:28.040]  On the right, we have a CSP server with a supported CPU, on which Jane wants to deploy her workload.
[05:28.040 --> 05:34.600]  How does Jane ensure integrity of the workload being executed by CSP and confidentiality
[05:34.600 --> 05:37.040]  of his data in use?
[05:37.040 --> 05:44.520]  Do that, Jane will ask to execute her workload in NRX.
[05:44.520 --> 05:50.000]  The first thing that the KEEP does is it asks a secure CPU to measure the encrypted memory
[05:50.000 --> 05:53.160]  pages containing the KEEP itself.
[05:53.160 --> 05:56.840]  This is the execution layer and the sheen.
[05:56.840 --> 06:02.120]  The CPU then returns a cryptographically signed attestation report containing the measurement
[06:02.120 --> 06:07.480]  or along with information about the CPU, for example, the firmware version used.
[06:07.480 --> 06:11.920]  The execution layer then sends the report to an attestation service for validation.
[06:11.920 --> 06:16.360]  In NRX, this attestation service is called Steward.
[06:16.360 --> 06:20.320]  The Steward will make sure that the KEEP is indeed trusted.
[06:20.320 --> 06:24.360]  It will check the signature of the report to ensure it's being run in a hardware-based
[06:24.360 --> 06:29.160]  trusted institution environment and will also make sure, for example, that the CPU firmware
[06:29.160 --> 06:34.000]  version used is not vulnerable and will verify that NRX execution layer was not tampered
[06:34.000 --> 06:35.800]  with.
[06:35.800 --> 06:40.880]  On successful attestation, Steward then issues a certificate for the KEEP, which is used
[06:40.880 --> 06:43.720]  to fetch the workload from a registry.
[06:43.720 --> 06:46.600]  We call it drawbridge in NRX.
[06:46.600 --> 06:51.880]  And the certificate is also used for performing cryptographic operations, for example, for
[06:51.880 --> 06:55.320]  providing transparent TLS to the workload.
[06:55.320 --> 07:01.000]  Now let's see how this works in practice.
[07:01.000 --> 07:06.720]  To begin with, let's see how do we actually run something within an NRX KEEP.
[07:06.720 --> 07:12.400]  The fundamental unit of work executed by NRX today consists of just a WebAssembly executable
[07:12.400 --> 07:14.840]  and NRX KEEP configuration.
[07:14.840 --> 07:20.160]  For example, here it looks for my chat server that is going to secure later.
[07:20.160 --> 07:22.320]  This is the KEEP configuration.
[07:22.320 --> 07:28.320]  So here is my Steward configured, my personal Steward that I've deployed on VPS, and my
[07:28.320 --> 07:30.080]  Stern IO configuration.
[07:30.080 --> 07:33.800]  And in this case, I want to inherit everything from the host, so that means I want to print
[07:33.800 --> 07:38.040]  everything from the host and I also get a sign in from the host.
[07:38.040 --> 07:46.400]  This file will also contain things like network policy or trust anchors and other things like
[07:46.400 --> 07:47.400]  that.
[07:47.400 --> 07:53.280]  I've already uploaded this to my personal drawbridge and I tagged it with a tag of 010.
[07:53.280 --> 07:56.160]  So let's see what that looks like.
[07:56.160 --> 08:01.480]  For that, I'll do a request to my drawbridge and what I get back here for this request
[08:01.480 --> 08:03.160]  is a tag, right?
[08:03.160 --> 08:06.400]  Or we also call it an entry.
[08:06.400 --> 08:11.400]  And so an entry is nothing else than a node inside a merkle tree.
[08:11.400 --> 08:16.360]  And it's a merkle tree because it contains the digest of the contents of itself.
[08:16.360 --> 08:22.280]  Now what does it mean is that if I would, for example, go one layer deeper and inspect
[08:22.280 --> 08:28.000]  the actual tree associated with this tag, I'll see that it contains the NRS.toml and
[08:28.000 --> 08:30.480]  made it wasn't we've seen earlier.
[08:30.480 --> 08:36.960]  Now if I were to, for example, compute the digest of my NRS.toml, you'll see that this
[08:36.960 --> 08:42.120]  is exactly the same digest we see here and here.
[08:42.120 --> 08:49.640]  Now I can go, of course, one step up and instead of computing the digest of the NRS.toml,
[08:49.640 --> 08:57.120]  I can compute the digest of the, well, the actual entry, the actual tag, right?
[08:57.120 --> 09:02.720]  For that, I will just do a request again to the same URL and again compute the digest
[09:02.720 --> 09:03.720]  of it.
[09:03.720 --> 09:10.280]  Now, if you remember, you'll notice that this is again exactly the same digest that
[09:10.280 --> 09:12.720]  we see in our tag, right?
[09:12.720 --> 09:19.840]  And so this digest is, in fact, a digest of the minified JSON of this object that we've
[09:19.840 --> 09:21.120]  seen over here, right?
[09:21.120 --> 09:27.040]  So this is nicely formed for us by JQ, but we need to request directly, just get a minified
[09:27.040 --> 09:31.600]  JSON, which we then hash.
[09:31.600 --> 09:40.520]  So let's, so here I'm logged in to AMD's 7SMP capable machine.
[09:40.520 --> 09:50.120]  I could, for example, read the CPU info and I will grab for model name and only want one
[09:50.120 --> 09:56.480]  entry and see that this is indeed an AMD Apex 513 processor.
[09:56.480 --> 10:05.920]  So I'm going to use NRX deploy and I'll also specify the backend explicitly to, yeah,
[10:05.920 --> 10:10.480]  well, deploy the work code we just looked at.
[10:10.480 --> 10:16.960]  So I'm going to use again my local, well, not my local, sorry, my custom drawbridge.
[10:16.960 --> 10:25.080]  I'll deploy the chat server version 101, exactly the same one that we have seen before.
[10:25.080 --> 10:32.040]  And then I'm going to switch to yet another server again remote.
[10:32.040 --> 10:50.960]  This one has support for the sgx and again I'll do, here we see this is Intel Xeon 6338.
[10:50.960 --> 11:01.480]  And here I'll also do NRX deploy and in this case I will execute the chat client.
[11:01.480 --> 11:14.360]  Now once it starts, it will ask me for a URL, I'll put here the address and the port.
[11:14.360 --> 11:21.760]  So you can see here I've connected, here you can see the server also acknowledged the connection.
[11:21.760 --> 11:28.920]  And if you just look here, you'll see the exact same digest we've just seen in our entry.
[11:28.920 --> 11:30.760]  It was over here.
[11:30.760 --> 11:38.080]  So we also see the slug of the server, we just rented that other server, the version.
[11:38.080 --> 11:45.320]  So all this information came from the certificate, it's cryptographically signed data contained
[11:45.320 --> 11:50.720]  within the certificate, which we are, well, NRX actually parts for us and exposed to the
[11:50.720 --> 11:51.720]  work load.
[11:51.720 --> 11:59.960]  Similarly, the server also have received the slug that the client was deployed from and
[11:59.960 --> 12:03.000]  it also received the digest of the work load.
[12:03.000 --> 12:09.320]  So by looking at the certificate, we now can know exactly what workload is that other
[12:09.320 --> 12:11.720]  party running.
[12:11.720 --> 12:20.920]  We could also try to inspect this, we can use OpenSSL to connect and sure enough we
[12:20.920 --> 12:27.640]  see our certificate, you can see here that it's currently called, it's a common name,
[12:27.640 --> 12:32.080]  it should be a san of course, but it's just a proof of concept.
[12:32.080 --> 12:37.880]  So you can see here the certificate chain that we have, well, we have a certificate
[12:37.880 --> 12:42.800]  with a common name associated with the slug and the digest.
[12:42.800 --> 12:52.080]  And it was issued to us by the steward, by the steward that I have deployed in my infrastructure.
[12:52.080 --> 13:01.680]  And there's also my own CA in the root chain, which actually signed as a steward cert before.
[13:01.680 --> 13:06.800]  And if we look at the server logs, we'll notice the OpenSSL connection, which actually
[13:06.800 --> 13:10.640]  was not left in by the server.
[13:10.640 --> 13:14.200]  And it says here that the client did not present a valid certificate.
[13:14.200 --> 13:18.480]  So this was not a keep with the valid certificate issued by the steward, therefore the server
[13:18.480 --> 13:22.600]  didn't trust it and didn't let it in the secure chat room.
[13:22.600 --> 13:30.800]  Similarly, if I were to use NRX with a different backend than SGX, for example, I would use
[13:30.800 --> 13:36.000]  a KVM, which is not a real TE, right, it's just a KVM backend, it will not even attest
[13:36.000 --> 13:37.000]  to the steward.
[13:37.000 --> 13:44.520]  So the steward wouldn't issue a cert for us, right, and then we cannot actually execute
[13:44.520 --> 13:47.720]  the workload in NRX.
[13:47.720 --> 13:51.800]  Now let's look at how we actually achieved this.
[13:51.800 --> 13:54.480]  And to begin with, let's look at the client.
[13:54.480 --> 13:57.960]  And you'll notice it's quite a small executable actually.
[13:57.960 --> 14:04.200]  And notice also, so this workload doesn't actually need to do any TLS itself or anything
[14:04.200 --> 14:05.200]  like that.
[14:05.200 --> 14:10.120]  NRX Runtime handles all the TLS connections for it, so, and by default all connections
[14:10.120 --> 14:12.520]  are TLS anyway.
[14:12.520 --> 14:19.320]  So we're going to use a virtual file system to connect to an address at runtime.
[14:19.320 --> 14:25.040]  Unfortunately, it's required right now due to the limitation of the YG spec, but I get
[14:25.160 --> 14:31.560]  there's more going on on providing this APIs, but currently it's not possible to just call
[14:31.560 --> 14:36.840]  or connect Cisco like you would normally do, but that's why NRX provides a virtual file
[14:36.840 --> 14:40.720]  system to actually connect to a particular address.
[14:40.720 --> 14:46.160]  Now similarly, there's another virtual file system to extract the peer data from the connection
[14:46.160 --> 14:52.720]  we have established, and in this case we can simply match on that peer information.
[14:52.760 --> 14:58.440]  So here for example, if we are presented with an anonymous peer, so this which did not have
[14:58.440 --> 15:01.600]  a TLS certificate, we just simply abort.
[15:01.600 --> 15:06.680]  And this would also be triggered if the certificate would be not signed by a trusted party, like
[15:06.680 --> 15:09.520]  a stewardly trust.
[15:09.520 --> 15:15.720]  If it was a local workload, and it was executed in a real TEE, right, we could still trust
[15:15.720 --> 15:21.720]  it because we know the expected digest of the packages we have uploaded to the drawbridge.
[15:21.720 --> 15:28.720]  This by the way, the exact same digest we have seen before, maybe you see, it is over
[15:28.720 --> 15:29.720]  here.
[15:29.720 --> 15:34.640]  So this is the exact same digest we've looked at before.
[15:34.640 --> 15:40.240]  Now in a high B flow, of course, we're presented with the actual NRX key, which is then associated
[15:40.240 --> 15:44.040]  with a slug and the digest.
[15:44.040 --> 15:50.000]  And what we can do here is we can actually match on the actual workload slug.
[15:50.000 --> 15:55.480]  So where did this workload actually came from, it's version, right, and in this case
[15:55.480 --> 16:02.320]  we don't even need to check the digest because we trust the drawbridge slug.
[16:02.320 --> 16:08.800]  So in this case, we have verified these three versions, and we do not want to allow any
[16:08.800 --> 16:10.480]  other versions, right.
[16:10.480 --> 16:16.280]  Of course, this would eventually become a key configuration, probably, it could be specified
[16:16.280 --> 16:21.920]  as a tunnel, but for now, just for simplicity, I've included everything in the source code.
[16:21.920 --> 16:28.120]  Now similarly, we have the server part.
[16:28.120 --> 16:33.120]  And it has a very similar peer check over here, where it again checks for anonymous local
[16:33.120 --> 16:34.120]  key.
[16:34.120 --> 16:40.320]  And it actually doesn't want any local workload in, and it only allows essentially official
[16:40.320 --> 16:49.640]  releases that they're verified and were issued perhaps by this entity over here.
[16:49.640 --> 16:52.360]  So let's get back to the slides.
[16:52.360 --> 16:57.440]  If you're interested in this project, you can get involved using one of the links provided
[16:57.440 --> 16:59.760]  over here.
[16:59.760 --> 17:03.720]  And yeah, now a moment of a set announcement.
[17:03.720 --> 17:10.240]  Just a few hours before recording this video, I found out that Profian is closing, and therefore
[17:10.240 --> 17:14.560]  the NRS project is looking for maintainers, and I'm looking for a job.
[17:14.560 --> 17:20.840]  So if you know anyone who would be interested in the NRS project or me, please let me know.
[17:20.840 --> 17:26.120]  You can contact me or email or LinkedIn, and here's my Github handle.
[17:26.120 --> 17:28.960]  And yeah, now it's time for questions.
[17:28.960 --> 17:30.400]  Thank you.