[00:00.000 --> 00:10.920] Hi, my name is Roman, I'm Principal Software Engineer and Network Service Technique at [00:10.920 --> 00:11.920] Profine. [00:11.920 --> 00:16.960] Today I'll tell you how to build a platform agnostic and hardware agnostic secure network [00:16.960 --> 00:20.640] of trusted applications on untrusted hosts. [00:20.640 --> 00:22.240] We all love the cloud. [00:22.240 --> 00:23.240] It's convenient. [00:23.240 --> 00:28.320] It enables companies to save money, grow faster and illuminates the need for a ton of work [00:28.320 --> 00:30.840] for managing and maintaining our own infrastructure. [00:30.840 --> 00:32.840] It simply makes our lives easier. [00:32.840 --> 00:35.680] Well, for the most part. [00:35.680 --> 00:40.480] Unfortunately, security breaches do happen and they're costly. [00:40.480 --> 00:47.200] According to IBM Cost of a Data Breach 2022 report, $9.44 million is the average cost of [00:47.200 --> 00:52.880] a data breach in the US, $4.35 million is the average total cost of a data breach globally [00:52.880 --> 00:58.880] and $10.10 million is the average total cost of a breach in the healthcare industry. [00:58.880 --> 01:03.440] Unfortunately, or rather quite fortunately given the risks, businesses from various highly [01:03.440 --> 01:07.960] regulated sectors like financial or medical simply cannot benefit from cloud offerings [01:07.960 --> 01:10.960] due to different laws around things like privacy and data protection. [01:10.960 --> 01:13.960] But it doesn't necessarily have to be this way. [01:13.960 --> 01:19.360] If it is for computing, by allowing protection of data in use creates opportunities to do [01:19.360 --> 01:22.320] things which simply weren't possible before. [01:22.320 --> 01:26.680] One way to benefit from the collection of computing would be to just simply use the [01:26.680 --> 01:27.680] TEs directly. [01:27.680 --> 01:33.120] For example, we could use the SDK provided by the hardware manufacturer and equipped [01:33.120 --> 01:36.280] with a fixed stack of documentation all we go. [01:36.280 --> 01:39.360] It works, but there are quite a few drawbacks. [01:39.360 --> 01:42.720] First and foremost, security is hard. [01:42.720 --> 01:47.360] Writing software directly communicating with a secure CPU is not exactly everyone's cup [01:47.360 --> 01:48.520] of tea. [01:48.520 --> 01:54.960] If all you need is a simple microservice application with a small REST API, diving deep into internals [01:54.960 --> 01:58.360] of a particular hardware technology just should not be necessary. [01:58.360 --> 02:02.800] It takes away the precious time that could be otherwise spent on developing revenue-producing [02:02.800 --> 02:03.800] business logic. [02:03.800 --> 02:09.120] But let's say we went ahead and developed a secure layer interfacing with a particular [02:09.120 --> 02:10.120] CPU technology. [02:10.120 --> 02:12.480] Well, now we have to maintain it. [02:12.480 --> 02:17.960] Now apart from that, we also have to fix any bugs while having it reduced and hoped that [02:17.960 --> 02:20.920] none of them are exploitable. [02:20.920 --> 02:24.760] People make mistakes, and the more code there is, the more opportunity there is to make [02:24.760 --> 02:26.760] one. [02:26.760 --> 02:30.600] After putting all of this work in, now imagine that you want to switch to a different service [02:30.600 --> 02:35.720] provider, which does not offer the same hardware technology you've used originally. [02:35.720 --> 02:40.160] Or much more concerning, what if vulnerability is discovered in a particular hardware technology [02:40.160 --> 02:41.880] you developed against? [02:41.880 --> 02:46.360] The different trust execution environments just are not exactly compatible. [02:46.360 --> 02:50.800] So your level of just two choices really is either wait until the vulnerability is fixed [02:50.800 --> 02:54.640] and hope your application is not exploited in the meantime, or you go ahead and redo [02:54.640 --> 02:59.600] all of the work you've already done for the original technology for the new one. [02:59.600 --> 03:05.280] Last but not least, chances are that someone had already done this before, and fundamentally [03:05.280 --> 03:08.160] the concepts that make systems secure do not change. [03:08.160 --> 03:13.520] So most likely you're going to just repeat the same work someone else had already done. [03:13.520 --> 03:19.760] At Rofin, we are custodians of the NRX open source project, which among other things is [03:19.760 --> 03:23.080] designed to address exactly the issues I've just outlined. [03:23.080 --> 03:28.600] It's a thin, secure layer of abstraction in between the host and the TE. [03:28.600 --> 03:33.600] It's essentially a secure runtime, which lets you execute your WebAssembly workloads inside [03:33.600 --> 03:35.880] arbitrary trust execution environments. [03:35.880 --> 03:42.240] NRX has supported various backends, today that's Intel GX and AMD Cells and P, but as [03:42.240 --> 03:47.280] more and more TEs are made available, support will be added for them as well. [03:47.280 --> 03:52.320] NRX project was started in 2019, and in 2021, Rofin was founded, which was committed to [03:52.320 --> 03:57.480] being 100% open source and providing services and support for NRX. [03:57.480 --> 04:01.080] In 2022, we also launched our enterprise products. [04:01.080 --> 04:04.800] So now why WebAssembly? [04:04.800 --> 04:05.800] It's polyglot. [04:05.800 --> 04:11.720] It's supported by languages like Rust, C, C++, Go, Java, Python, C Sharp, Java, Ruby, [04:12.200 --> 04:14.080] and the list goes on and on. [04:14.080 --> 04:17.440] So it's designed to be portable and embeddable. [04:17.440 --> 04:22.280] It has functional equivalents to a usual native binary, so for the most part, development [04:22.280 --> 04:26.920] process is exactly the same as for developing any other application. [04:26.920 --> 04:33.000] There are emerging system API standards, called WASI, to which, by the way, we also contribute. [04:33.000 --> 04:37.960] You can run NRX outside of TE for development purposes. [04:37.960 --> 04:44.520] It runs on Linux, Windows, and Mac, both XA664 and ARM64 are supported. [04:44.520 --> 04:49.760] Trusted execution is currently only available on XA664 Linux. [04:49.760 --> 04:54.880] For SGX, you'll need a recent kernel, and a few Intel provider services running, like [04:54.880 --> 05:02.280] ASMD and PCCS, and for AMD Cells and P, all you really need is, unfortunately, a recent [05:02.280 --> 05:04.560] kernel with a patch set provided by AMD. [05:04.560 --> 05:09.440] So the patches are not mainline yet, but we also maintain our own kernel tree with everything [05:09.440 --> 05:12.400] you could possibly need for this. [05:12.400 --> 05:16.760] Now let's see how is NRX actually deployed. [05:16.760 --> 05:18.960] On the left here, we have a tenant. [05:18.960 --> 05:20.400] Let's call her Jane. [05:20.400 --> 05:28.040] On the right, we have a CSP server with a supported CPU, on which Jane wants to deploy her workload. [05:28.040 --> 05:34.600] How does Jane ensure integrity of the workload being executed by CSP and confidentiality [05:34.600 --> 05:37.040] of his data in use? [05:37.040 --> 05:44.520] Do that, Jane will ask to execute her workload in NRX. [05:44.520 --> 05:50.000] The first thing that the KEEP does is it asks a secure CPU to measure the encrypted memory [05:50.000 --> 05:53.160] pages containing the KEEP itself. [05:53.160 --> 05:56.840] This is the execution layer and the sheen. [05:56.840 --> 06:02.120] The CPU then returns a cryptographically signed attestation report containing the measurement [06:02.120 --> 06:07.480] or along with information about the CPU, for example, the firmware version used. [06:07.480 --> 06:11.920] The execution layer then sends the report to an attestation service for validation. [06:11.920 --> 06:16.360] In NRX, this attestation service is called Steward. [06:16.360 --> 06:20.320] The Steward will make sure that the KEEP is indeed trusted. [06:20.320 --> 06:24.360] It will check the signature of the report to ensure it's being run in a hardware-based [06:24.360 --> 06:29.160] trusted institution environment and will also make sure, for example, that the CPU firmware [06:29.160 --> 06:34.000] version used is not vulnerable and will verify that NRX execution layer was not tampered [06:34.000 --> 06:35.800] with. [06:35.800 --> 06:40.880] On successful attestation, Steward then issues a certificate for the KEEP, which is used [06:40.880 --> 06:43.720] to fetch the workload from a registry. [06:43.720 --> 06:46.600] We call it drawbridge in NRX. [06:46.600 --> 06:51.880] And the certificate is also used for performing cryptographic operations, for example, for [06:51.880 --> 06:55.320] providing transparent TLS to the workload. [06:55.320 --> 07:01.000] Now let's see how this works in practice. [07:01.000 --> 07:06.720] To begin with, let's see how do we actually run something within an NRX KEEP. [07:06.720 --> 07:12.400] The fundamental unit of work executed by NRX today consists of just a WebAssembly executable [07:12.400 --> 07:14.840] and NRX KEEP configuration. [07:14.840 --> 07:20.160] For example, here it looks for my chat server that is going to secure later. [07:20.160 --> 07:22.320] This is the KEEP configuration. [07:22.320 --> 07:28.320] So here is my Steward configured, my personal Steward that I've deployed on VPS, and my [07:28.320 --> 07:30.080] Stern IO configuration. [07:30.080 --> 07:33.800] And in this case, I want to inherit everything from the host, so that means I want to print [07:33.800 --> 07:38.040] everything from the host and I also get a sign in from the host. [07:38.040 --> 07:46.400] This file will also contain things like network policy or trust anchors and other things like [07:46.400 --> 07:47.400] that. [07:47.400 --> 07:53.280] I've already uploaded this to my personal drawbridge and I tagged it with a tag of 010. [07:53.280 --> 07:56.160] So let's see what that looks like. [07:56.160 --> 08:01.480] For that, I'll do a request to my drawbridge and what I get back here for this request [08:01.480 --> 08:03.160] is a tag, right? [08:03.160 --> 08:06.400] Or we also call it an entry. [08:06.400 --> 08:11.400] And so an entry is nothing else than a node inside a merkle tree. [08:11.400 --> 08:16.360] And it's a merkle tree because it contains the digest of the contents of itself. [08:16.360 --> 08:22.280] Now what does it mean is that if I would, for example, go one layer deeper and inspect [08:22.280 --> 08:28.000] the actual tree associated with this tag, I'll see that it contains the NRS.toml and [08:28.000 --> 08:30.480] made it wasn't we've seen earlier. [08:30.480 --> 08:36.960] Now if I were to, for example, compute the digest of my NRS.toml, you'll see that this [08:36.960 --> 08:42.120] is exactly the same digest we see here and here. [08:42.120 --> 08:49.640] Now I can go, of course, one step up and instead of computing the digest of the NRS.toml, [08:49.640 --> 08:57.120] I can compute the digest of the, well, the actual entry, the actual tag, right? [08:57.120 --> 09:02.720] For that, I will just do a request again to the same URL and again compute the digest [09:02.720 --> 09:03.720] of it. [09:03.720 --> 09:10.280] Now, if you remember, you'll notice that this is again exactly the same digest that [09:10.280 --> 09:12.720] we see in our tag, right? [09:12.720 --> 09:19.840] And so this digest is, in fact, a digest of the minified JSON of this object that we've [09:19.840 --> 09:21.120] seen over here, right? [09:21.120 --> 09:27.040] So this is nicely formed for us by JQ, but we need to request directly, just get a minified [09:27.040 --> 09:31.600] JSON, which we then hash. [09:31.600 --> 09:40.520] So let's, so here I'm logged in to AMD's 7SMP capable machine. [09:40.520 --> 09:50.120] I could, for example, read the CPU info and I will grab for model name and only want one [09:50.120 --> 09:56.480] entry and see that this is indeed an AMD Apex 513 processor. [09:56.480 --> 10:05.920] So I'm going to use NRX deploy and I'll also specify the backend explicitly to, yeah, [10:05.920 --> 10:10.480] well, deploy the work code we just looked at. [10:10.480 --> 10:16.960] So I'm going to use again my local, well, not my local, sorry, my custom drawbridge. [10:16.960 --> 10:25.080] I'll deploy the chat server version 101, exactly the same one that we have seen before. [10:25.080 --> 10:32.040] And then I'm going to switch to yet another server again remote. [10:32.040 --> 10:50.960] This one has support for the sgx and again I'll do, here we see this is Intel Xeon 6338. [10:50.960 --> 11:01.480] And here I'll also do NRX deploy and in this case I will execute the chat client. [11:01.480 --> 11:14.360] Now once it starts, it will ask me for a URL, I'll put here the address and the port. [11:14.360 --> 11:21.760] So you can see here I've connected, here you can see the server also acknowledged the connection. [11:21.760 --> 11:28.920] And if you just look here, you'll see the exact same digest we've just seen in our entry. [11:28.920 --> 11:30.760] It was over here. [11:30.760 --> 11:38.080] So we also see the slug of the server, we just rented that other server, the version. [11:38.080 --> 11:45.320] So all this information came from the certificate, it's cryptographically signed data contained [11:45.320 --> 11:50.720] within the certificate, which we are, well, NRX actually parts for us and exposed to the [11:50.720 --> 11:51.720] work load. [11:51.720 --> 11:59.960] Similarly, the server also have received the slug that the client was deployed from and [11:59.960 --> 12:03.000] it also received the digest of the work load. [12:03.000 --> 12:09.320] So by looking at the certificate, we now can know exactly what workload is that other [12:09.320 --> 12:11.720] party running. [12:11.720 --> 12:20.920] We could also try to inspect this, we can use OpenSSL to connect and sure enough we [12:20.920 --> 12:27.640] see our certificate, you can see here that it's currently called, it's a common name, [12:27.640 --> 12:32.080] it should be a san of course, but it's just a proof of concept. [12:32.080 --> 12:37.880] So you can see here the certificate chain that we have, well, we have a certificate [12:37.880 --> 12:42.800] with a common name associated with the slug and the digest. [12:42.800 --> 12:52.080] And it was issued to us by the steward, by the steward that I have deployed in my infrastructure. [12:52.080 --> 13:01.680] And there's also my own CA in the root chain, which actually signed as a steward cert before. [13:01.680 --> 13:06.800] And if we look at the server logs, we'll notice the OpenSSL connection, which actually [13:06.800 --> 13:10.640] was not left in by the server. [13:10.640 --> 13:14.200] And it says here that the client did not present a valid certificate. [13:14.200 --> 13:18.480] So this was not a keep with the valid certificate issued by the steward, therefore the server [13:18.480 --> 13:22.600] didn't trust it and didn't let it in the secure chat room. [13:22.600 --> 13:30.800] Similarly, if I were to use NRX with a different backend than SGX, for example, I would use [13:30.800 --> 13:36.000] a KVM, which is not a real TE, right, it's just a KVM backend, it will not even attest [13:36.000 --> 13:37.000] to the steward. [13:37.000 --> 13:44.520] So the steward wouldn't issue a cert for us, right, and then we cannot actually execute [13:44.520 --> 13:47.720] the workload in NRX. [13:47.720 --> 13:51.800] Now let's look at how we actually achieved this. [13:51.800 --> 13:54.480] And to begin with, let's look at the client. [13:54.480 --> 13:57.960] And you'll notice it's quite a small executable actually. [13:57.960 --> 14:04.200] And notice also, so this workload doesn't actually need to do any TLS itself or anything [14:04.200 --> 14:05.200] like that. [14:05.200 --> 14:10.120] NRX Runtime handles all the TLS connections for it, so, and by default all connections [14:10.120 --> 14:12.520] are TLS anyway. [14:12.520 --> 14:19.320] So we're going to use a virtual file system to connect to an address at runtime. [14:19.320 --> 14:25.040] Unfortunately, it's required right now due to the limitation of the YG spec, but I get [14:25.160 --> 14:31.560] there's more going on on providing this APIs, but currently it's not possible to just call [14:31.560 --> 14:36.840] or connect Cisco like you would normally do, but that's why NRX provides a virtual file [14:36.840 --> 14:40.720] system to actually connect to a particular address. [14:40.720 --> 14:46.160] Now similarly, there's another virtual file system to extract the peer data from the connection [14:46.160 --> 14:52.720] we have established, and in this case we can simply match on that peer information. [14:52.760 --> 14:58.440] So here for example, if we are presented with an anonymous peer, so this which did not have [14:58.440 --> 15:01.600] a TLS certificate, we just simply abort. [15:01.600 --> 15:06.680] And this would also be triggered if the certificate would be not signed by a trusted party, like [15:06.680 --> 15:09.520] a stewardly trust. [15:09.520 --> 15:15.720] If it was a local workload, and it was executed in a real TEE, right, we could still trust [15:15.720 --> 15:21.720] it because we know the expected digest of the packages we have uploaded to the drawbridge. [15:21.720 --> 15:28.720] This by the way, the exact same digest we have seen before, maybe you see, it is over [15:28.720 --> 15:29.720] here. [15:29.720 --> 15:34.640] So this is the exact same digest we've looked at before. [15:34.640 --> 15:40.240] Now in a high B flow, of course, we're presented with the actual NRX key, which is then associated [15:40.240 --> 15:44.040] with a slug and the digest. [15:44.040 --> 15:50.000] And what we can do here is we can actually match on the actual workload slug. [15:50.000 --> 15:55.480] So where did this workload actually came from, it's version, right, and in this case [15:55.480 --> 16:02.320] we don't even need to check the digest because we trust the drawbridge slug. [16:02.320 --> 16:08.800] So in this case, we have verified these three versions, and we do not want to allow any [16:08.800 --> 16:10.480] other versions, right. [16:10.480 --> 16:16.280] Of course, this would eventually become a key configuration, probably, it could be specified [16:16.280 --> 16:21.920] as a tunnel, but for now, just for simplicity, I've included everything in the source code. [16:21.920 --> 16:28.120] Now similarly, we have the server part. [16:28.120 --> 16:33.120] And it has a very similar peer check over here, where it again checks for anonymous local [16:33.120 --> 16:34.120] key. [16:34.120 --> 16:40.320] And it actually doesn't want any local workload in, and it only allows essentially official [16:40.320 --> 16:49.640] releases that they're verified and were issued perhaps by this entity over here. [16:49.640 --> 16:52.360] So let's get back to the slides. [16:52.360 --> 16:57.440] If you're interested in this project, you can get involved using one of the links provided [16:57.440 --> 16:59.760] over here. [16:59.760 --> 17:03.720] And yeah, now a moment of a set announcement. [17:03.720 --> 17:10.240] Just a few hours before recording this video, I found out that Profian is closing, and therefore [17:10.240 --> 17:14.560] the NRS project is looking for maintainers, and I'm looking for a job. [17:14.560 --> 17:20.840] So if you know anyone who would be interested in the NRS project or me, please let me know. [17:20.840 --> 17:26.120] You can contact me or email or LinkedIn, and here's my Github handle. [17:26.120 --> 17:28.960] And yeah, now it's time for questions. [17:28.960 --> 17:30.400] Thank you.