[00:00.000 --> 00:17.120]  Hello? Okay. Now it works. Kind of, right? Okay. So hello everyone. Welcome to Security
[00:17.120 --> 00:22.880]  Dev Room and we've got our next talk about key lime and thermal attestation which will
[00:22.880 --> 00:35.560]  be given by Anderson and Thorsen. Okay. So welcome. Sorry about the trouble. So I'm Anderson
[00:35.560 --> 00:41.960]  as I am a software engineer at Red Hat and I'm here with Thorsen. Yeah, I'm Thorsen and I'm a
[00:41.960 --> 00:47.000]  maintainer of Linux distribution for school and universities and I'm also a maintainer of key lime.
[00:47.000 --> 00:52.720]  Yeah, so we are here to talk about remote attestation with key lime. So let's get started.
[00:52.720 --> 01:00.200]  Imagine you are like a car vendor who maintains and updates the systems running in cars but you
[01:00.200 --> 01:06.920]  want to make sure that the systems in the cars were not modified so that you can check if the
[01:06.920 --> 01:13.320]  customer is still eligible to receive the latest updates or something like that. Or you are a
[01:13.320 --> 01:20.600]  software company building software in the cloud but you want to make sure that the build tooling
[01:20.600 --> 01:27.440]  was not modified or you are a telecom company that wants to make sure that the systems you deployed
[01:27.440 --> 01:37.160]  that controls antennas they were not modified. So what all these cases have in common is first
[01:37.160 --> 01:45.120]  they are remote, second you don't really have full control of the systems in the world. So the
[01:45.120 --> 01:53.360]  question is how can you check that the system was not modified in the in the world. So our way
[01:53.360 --> 02:00.200]  would be if you could somehow get some information about the system and then check if it's what
[02:00.200 --> 02:08.840]  you expected from that. And of course in case it's not then you would want to have a way to
[02:08.840 --> 02:17.080]  react on that. So if you can do that continuously get the information checked then you have like
[02:17.080 --> 02:25.920]  monitoring of the integrity of the system. So that's what one of the things the remote
[02:25.920 --> 02:34.680]  attestation can provide is to check remote integrity, remote machine integrity, how it works. So you
[02:34.680 --> 02:42.040]  have a trusted entity running in some controlled environment and then you have a trusted agent
[02:42.040 --> 02:49.560]  on the other side running on the monitored system and you ask for the information to that agent
[02:49.560 --> 02:58.560]  and gets back some information called a quote then you can verify that that agent is running in a
[02:58.560 --> 03:08.720]  machine in a state that you trust. So that comes with the problem of trust. So how can you trust
[03:08.720 --> 03:16.320]  the machine or the agent running in some machine that you don't control. So you don't really trust
[03:16.320 --> 03:24.440]  directly the agent but you trust on a hardware root of trust which is the trusted platform model
[03:24.440 --> 03:33.840]  or TPM. What are the TPMs? They are pieces of hardware that can perform crypto operations such
[03:33.840 --> 03:42.960]  as generating keys, signing data and it has this special key and certificate called endorsement
[03:42.960 --> 03:52.040]  key which are generated during manufacturing. So the manufacturer generates the key and publishes
[03:52.040 --> 04:02.720]  the CA certificate so that you can verify that it is legitimate. And then this EK, the endorsement
[04:02.720 --> 04:09.360]  keys can't sign data directly but you can generate the attestation keys that are associated with
[04:09.360 --> 04:17.680]  that endorsement key in a way that you can verify the origin of some assigned data so that you can
[04:17.680 --> 04:27.400]  make sure that that data was signed by that specific TPM. So and another important thing that the
[04:27.400 --> 04:35.640]  TPM has are the platform conversion registers which are special registers designed to store
[04:35.640 --> 04:45.120]  measurements about the system and in a way that you can verify the integrity. So how these
[04:45.120 --> 04:55.440]  measurements are done? During boot, each step of the boot is measured by the UFI into the TPM
[04:55.440 --> 05:06.480]  via the PCR extend operation. So each step the boot process goes, you get a hash of the binary
[05:06.480 --> 05:18.120]  or the software that is running and extend into a PCR. I will explain that soon. And so during
[05:18.120 --> 05:26.560]  boot, the UFI is responsible for measuring the boot steps into the TPM and after boot, then the
[05:26.560 --> 05:34.600]  kernel integrity measurement architecture or IMA will measure any open file that matches a policy
[05:34.600 --> 05:45.240]  you can configure the IMA and it will measure the files open into a PCR as well. So if you have
[05:45.240 --> 05:54.240]  the information like the state of the PCR and the event log or all the operations, extend
[05:54.240 --> 06:00.960]  operations that were performed, then you can verify the integrity of the machine. So how this
[06:00.960 --> 06:08.760]  PCR extend algorithm works is kind of simple. You'll get the old value stored in the PCR,
[06:08.760 --> 06:18.800]  concatenated with the measurement from the data. So this measurement is basically a hash. So you
[06:18.800 --> 06:25.640]  concatenate the old value with the hash of the measurement, calculate the hash of all of these
[06:25.640 --> 06:39.760]  and put back into the PCR. So that's done for each step. So of course these PCRs, if you know a bit
[06:39.760 --> 06:47.560]  of TPM, they don't match the actual numbers, but this is just for illustration. So all of these,
[06:47.560 --> 06:57.080]  after measuring all these steps, you have the final value in the PCR that you can calculate
[06:57.080 --> 07:05.280]  like a called golden value, which you calculate like the hash of all the PCR values and you have a
[07:05.280 --> 07:16.840]  representation of the state of the machine and that can be verified. So how key lime works. So in
[07:16.840 --> 07:24.600]  the left side, you have trusted entity where you like probably a machine that you control where you
[07:24.600 --> 07:33.560]  run the verifier side of the key lime. It's a server and on the right side, you have the monitored
[07:33.560 --> 07:39.560]  system. It is remote. You don't have complete control of it, but the agent has access to the
[07:39.560 --> 07:50.240]  TPM installed in that machine and so the server can verify, the verifier can request a state to the
[07:50.240 --> 07:58.280]  agent. Then the agent will access the TPM to get the quote, meaning the PCR values and also
[07:58.280 --> 08:08.200]  together with the event logs or all the PCR extend operations that were performed and throw it back
[08:08.200 --> 08:21.360]  to the verifier. And then the verifier can verify first the origin of that piece of data because
[08:21.360 --> 08:28.000]  it's signed by the AK key. So you can make sure that that data came from that machine that contains
[08:28.000 --> 08:36.040]  that TPM and you can verify the identity of the TPM using the EK certificate. And with the values
[08:36.040 --> 08:45.680]  you obtained for the PCRs and the event log, you can replay all the extend operations so that in
[08:45.680 --> 08:52.680]  the end, you can get the values that it should have. And so with all these information, you can
[08:52.680 --> 09:03.040]  verify the integrity of the machine. Since you also got the information from AIMA, like all the
[09:03.040 --> 09:12.840]  files that were open and matched some policy, the AIMA will calculate the hash of open files
[09:12.840 --> 09:20.160]  that match some policy and extend to the PCR. So you get this log containing the file names and
[09:20.160 --> 09:29.920]  the matching hashes. So you can also with some policy engine verify the integrity of individual
[09:29.920 --> 09:42.120]  files in the remote machine. So you can like a full integrity view of the remote machine. So with
[09:42.120 --> 09:49.080]  that information, the verifier can check. If it's okay, it's okay. The attestation was successful.
[09:49.080 --> 09:55.960]  But if it was not, it doesn't match what you expected, then it's a failure. So in case of
[09:55.960 --> 10:06.080]  failure, we have a revocation framework, which is a way to, so you can configure some actions to
[10:06.080 --> 10:13.960]  the verifier, some script that it can run to perform some action. It can be some webhooks. So
[10:13.960 --> 10:20.720]  if some attestation fails, it sends some request to some webhook, or you can notify the agents
[10:20.720 --> 10:28.360]  directly via REST API, and send some payload to the trigger some operation there. The simplest
[10:28.360 --> 10:34.240]  scenario, for example, if you had a cluster with various machines, and one of them failed
[10:34.240 --> 10:40.960]  attestation, you cannot find the others to remove that node from the cluster by blocking the
[10:40.960 --> 10:48.600]  network connectivity, for example. So that's how key lime works in general. So now I'm passing the
[10:48.600 --> 10:57.560]  mic to Thor. He will continue with the real world stuff. Yeah. So now we heard how key lime works,
[10:57.560 --> 11:05.320]  and we want to show that you can use that in production, and what are the challenges that you
[11:05.320 --> 11:12.760]  will run into if you want to try that. We have three main scenarios there. We have first policy
[11:12.760 --> 11:20.520]  creation, then the monitoring, and then how to react on that. So the first part is, we want to
[11:20.520 --> 11:27.480]  create policies for our system. For that, we need to know what is actually on our system, and what
[11:27.480 --> 11:34.360]  are our systems. So from a software side perspective, it's normal that we integrate, we have a CI CD
[11:34.360 --> 11:40.000]  pipeline, we get what data gets into that, and we want to save the hashes there. But we need also
[11:40.000 --> 11:46.040]  a lot of other stuff. We want to know what packages are installed, where they end up on our system,
[11:46.040 --> 11:52.880]  have their signatures, can we verify that? That is what we normally want to have, and either this
[11:52.880 --> 11:59.160]  information is already provided by the distribution, or we need to generate that on our own. Then on
[11:59.160 --> 12:04.280]  the hardware side, we need to know what kind of hardware we're running. So as we said, we have the
[12:04.280 --> 12:11.080]  EK, so the endorsement key, we need to at least know that to trust the TPM in some regard. And then
[12:11.080 --> 12:15.960]  ideally, we want to also know what firmware is running on that device, and which configuration do
[12:15.960 --> 12:22.240]  we have. For example, do we allow CQB to be disabled and enabled? Do we have our own keys on
[12:22.240 --> 12:29.440]  there? And stuff like that. So if you have all that information, we can go to the other side,
[12:29.440 --> 12:36.680]  which is the monitoring. That part is implemented by KeyLine. You can, if you have all the necessary
[12:36.680 --> 12:41.640]  information, we provide documentation and tools to generate a policy for that, and you can feed it
[12:41.640 --> 12:49.320]  in that, and it's all there. The challenge that you run into here is that for many of you, probably
[12:49.320 --> 12:58.040]  IMA, measured boot, and TPMs are new. And if you run into issues, then you also try, need to try
[12:58.040 --> 13:04.560]  understand how that works to debug it. So that is a challenge you run into, that you still need a
[13:04.560 --> 13:11.920]  good understanding of those technologies to make your life easier. But yeah, that is mostly solved
[13:11.920 --> 13:18.880]  by KeyLine. And then we come to the non-technical side, which is we need to react somehow when we
[13:18.880 --> 13:25.800]  have a revocation failure. So is that a lot actually relevant for us? Because if we have like
[13:25.800 --> 13:32.040]  file changes in TAMP, we don't really care. But then who needs to be notified if you have that?
[13:32.040 --> 13:37.200]  Then how do we tie that in our current monitoring infrastructure? For example, like with the web
[13:37.200 --> 13:43.320]  hooks. And lastly, if you are a company and you're, it's a potential security breach if KeyLine
[13:43.320 --> 13:48.520]  fails in the way that you configured it. So there are service agreements in place, which need you
[13:48.520 --> 13:57.040]  notify and how do you respond for that. So, but going now from the general part to actual examples,
[13:57.040 --> 14:06.440]  I work on the Linux distribution that does exams for schools and universities called Lensdijk.
[14:06.440 --> 14:13.000]  And we developed with the University of Applied Sciences and Arts Northwestern Switzerland,
[14:13.000 --> 14:22.440]  also called FANV, a system called Kampler, which is secure, bring your own device exams. So what
[14:22.440 --> 14:27.760]  is the problem here? The students want to bring their own device, their own notebook into the
[14:27.760 --> 14:33.240]  lecture hall and want to write their exams on that. We don't want to touch their operating system.
[14:33.240 --> 14:37.840]  So we do something what we call bring your own hardware. They bring their own hardware and we
[14:37.840 --> 14:45.520]  boot a live Linux system on that system and remotely attest if that system is running correctly. So
[14:45.520 --> 14:52.120]  what do we have? We have our hardware, which has a hard drive and a TPM. Now we boot the distribution
[14:52.120 --> 14:57.880]  called Lensdijk. And on that we have the KeyLine agent running and also Imr and our measured boot
[14:57.880 --> 15:04.400]  stuff. And now the interesting part is we just care about the TPM. We don't care about the hard
[15:04.400 --> 15:13.960]  drive and what is on that system otherwise. So now we have the actual server solution. So we
[15:13.960 --> 15:22.040]  register to the exam system. And this also includes that we register to KeyLine. Then we check in
[15:22.040 --> 15:27.920]  return if the system is actually in a trustworthy state. And if that's the case, we release the
[15:27.920 --> 15:35.800]  exam files, which is in our case normally RDP session, which then connects to the cloud where
[15:35.800 --> 15:45.560]  the people are actually writing their exams. So why are we doing that this way? The first one is
[15:45.560 --> 15:52.480]  that we guarantee that the environment for every student is the same because they only provide
[15:52.480 --> 15:58.320]  their hardware and it's basically terminal to connect to the actual exam. So if there's computing
[15:58.320 --> 16:04.480]  intensive stuff, then it doesn't really matter. And also because they only bring their own hardware
[16:04.480 --> 16:11.760]  and don't need to install monitoring software on their system to write the exam, we don't care
[16:11.760 --> 16:17.560]  what they does on that. We don't want to know it's first for privacy and also to make setup
[16:17.560 --> 16:26.280]  way easier. Now back to a more traditional scenario that more of you are probably familiar with the
[16:26.280 --> 16:32.680]  cloud. And there we have the example that IBM uses this for hypervisor attestation. And they
[16:32.680 --> 16:40.720]  don't use runtime attestations or not anymore. They use measure boot to see if the hypervisor
[16:40.720 --> 16:49.720]  booted up correctly. So their challenges were that implementing the actual response procedures,
[16:49.720 --> 16:59.440]  so the procedure from we have an alert to how do we deal with that now. That is the difficult part
[16:59.440 --> 17:04.600]  because the one is the technical side, but how do we structure our teams in a way that we can
[17:04.600 --> 17:12.040]  guarantee that. Then also the other ones are eliminating false positives that ties into the
[17:12.040 --> 17:18.400]  first point because if a human needs to react, then we want to have no false positives and also
[17:18.400 --> 17:24.840]  no false negatives ideally. And false negatives are for security very, very bad. So we don't want
[17:24.840 --> 17:31.440]  to have that. And lastly is keeping the policies up to date. Even if you roll your own distribution
[17:31.440 --> 17:37.600]  and are big enough, it's very difficult to be up to date on that policies and integrate them
[17:37.600 --> 17:43.920]  automatically. And lastly, they have an escalation chain just for illustration purposes. They use
[17:43.920 --> 17:50.120]  key lime to monitor that, tie that into their JIRA system, and then have an actual person react
[17:50.120 --> 17:59.480]  on the other side. So and then one point from a distribution. So in this case from SUSE, I asked
[17:59.480 --> 18:06.440]  them and they integrated key lime into pretty much any product. So it's an open SUSE today. If you
[18:06.440 --> 18:12.040]  want to use it in microS, there's instruction to do that. And then also in SUSE Enterprise Linux
[18:12.040 --> 18:20.320]  and an ALP. Their challenges also we're like integrating it with as a Linux fully and making
[18:20.320 --> 18:29.160]  IMA usable. So do we have signatures? How do we provide the hashes? And the general thing for
[18:29.160 --> 18:38.880]  distribution is how do we provide robust policies in general? Because we want users to try out
[18:38.880 --> 18:45.600]  the technology and want to experiment with that. But how do we give them the starting point? And
[18:45.600 --> 18:50.320]  that is still a very difficult because as we saw, there are many data points that needs to be
[18:50.320 --> 18:55.760]  collected. And that is a challenge that they're trying to solve actively by making, getting the
[18:55.760 --> 19:05.600]  signatures and the hashes easier. Yeah. So to say for the end, try remote attestation today. The
[19:05.600 --> 19:13.120]  technology that you need to do that is pretty much in every device that you have like in a notebook
[19:13.120 --> 19:27.880]  that you can use. So you can find Keelem at Keelem.dev. And yeah, thank you. So do we have
[19:27.880 --> 19:40.120]  questions? Lots of questions. Thank you for a great presentation. One question. You talked a
[19:40.120 --> 19:45.800]  lot about the verification side of the processing. You talked a lot about the verification side,
[19:45.800 --> 19:53.560]  but to have the golden values or the PCRs in your verified system, you need to provision them. So I
[19:53.560 --> 20:00.520]  was not sure the distribution side of things was how do you manage that in Keelem? Could you shed
[20:00.520 --> 20:09.520]  some light on that? Yeah. So with the golden values, we have the values in the TPM and then
[20:09.520 --> 20:15.760]  they're also tied to an event log and IMA and like a measure boot. And we solve the issue that we
[20:15.760 --> 20:21.760]  don't actually need golden values by having a policy engine basically that verifies the logs
[20:21.760 --> 20:28.520]  itself checks if that those match the values, but then we check the logs and not the end value. So
[20:28.520 --> 20:33.640]  and then the distribution can help because they can provide like a lot of the signatures already
[20:33.640 --> 20:40.160]  and which files are in which packages and how they end up. That makes the life for the distribution
[20:40.160 --> 20:51.360]  easier. Yes, sir. What is the performance of such a check? How much time does it take and how much
[20:51.360 --> 21:01.440]  data is required for such a monitoring? From what I saw, I don't have like a benchmarks for that,
[21:01.440 --> 21:13.120]  but it's pretty quick, like 200 milliseconds, something like that. So the round trip from the
[21:13.120 --> 21:20.600]  request to the response is like 200 milliseconds in my test, but of course it's on my machine, right?
[21:20.600 --> 21:27.280]  We don't have benchmarks for the performance. Yeah, so it heavily also depends what you want to
[21:27.280 --> 21:33.120]  test. If you just have measured boot, it's the crow time on the hardware TPM maximum to a second,
[21:33.120 --> 21:40.440]  and then it's like a couple of most megabytes of a single digits that you have at data that is
[21:40.440 --> 21:47.520]  transferred. You had said that one of the challenges of implementation was dealing with false
[21:47.520 --> 21:54.440]  positives and maybe false negatives. Can you give some examples of when that would occur? Yeah,
[21:54.440 --> 22:01.920]  because we are still talking over the network. That is like a false positive if the network
[22:01.920 --> 22:07.560]  connection goes down. And the other one is it's kind of a false positive and not really is if
[22:07.560 --> 22:13.800]  your policy is not up to date. For the system, it's not a false positive in the traditional sense,
[22:13.800 --> 22:19.520]  but it's in the false positive because we don't want that alert actually to happen. For the
[22:19.520 --> 22:24.080]  university use case, how do you know that you're actually talking to the real TPM in the laptop?
[22:24.080 --> 22:35.480]  So we have two ways. First, we verify against the hardware manufacturer. So they have a CA that
[22:35.480 --> 22:42.040]  we can verify against. And also we can enroll the notebooks directly. So we check if the devices,
[22:42.040 --> 22:48.320]  which I forgot to say that the university part is still proof of concept. So we are currently
[22:48.320 --> 23:00.920]  working on it, but it's not rolled out like large scale. How do you make sure that an alert event,
[23:00.920 --> 23:08.400]  a new change happened? How do you make sure that it's not intercepted over the network? Sorry,
[23:08.400 --> 23:15.920]  once again. How do you make sure that when there's an event saying that there's a change on the
[23:15.920 --> 23:21.440]  machine, a new measurement that appears? How do you make sure that the event is not intercepted
[23:21.440 --> 23:29.040]  in the network between the monitored machine and the trusted system? So is the question,
[23:29.040 --> 23:36.320]  how do we deal with the losing connection between the agent, the monitored system and the verifier?
[23:36.320 --> 23:43.560]  Losing connection or having maybe something in between, making sure that it does not go to the
[23:43.560 --> 23:50.720]  trusted system. There's something in between that makes sure that you're never going to be
[23:50.720 --> 23:56.240]  notified that a system is going to be compromised or just became compromised.
[23:56.240 --> 24:05.720]  Did you get that? Yes. So if we have a blocking connection between the agent and the verifier
[24:05.720 --> 24:11.840]  side, then we get a timeout and then the agent gets automatically distressed. And if you said
[24:11.840 --> 24:18.400]  like from the notification system itself, if you say we notify all the other agents, of course,
[24:18.400 --> 24:22.960]  then there is an issue if you cannot reach to them on a trusted channel, then it's basically
[24:22.960 --> 24:28.000]  game over in that direction. So you need to get your revocation alerts if you want to guarantee
[24:28.000 --> 24:33.360]  them all the time through a trusted channel. So the trust boundary is for the attestation part
[24:33.360 --> 24:38.960]  that we see that, but the revocation part, if you want to reach that, then it needs to go through
[24:38.960 --> 24:45.880]  a trusted channel. Yeah. So continuing on this question, actually, I think, how do you make
[24:45.880 --> 24:51.240]  sure that your actual verifiers connect to the right agent and you don't have a man in the middle
[24:51.240 --> 24:58.160]  attack that's happening and rerouting this to a fake agent and fake TPM? Yeah. So that's tied
[24:58.160 --> 25:05.000]  with the EK certificate. So you trust the manufacturer because when they manufacture the
[25:05.000 --> 25:12.680]  TPM, they will create this key that cannot be modified or removed in any way. So it provides
[25:12.680 --> 25:19.880]  the identity to the TPM. So when you get the information from the TPM or from some agent,
[25:19.880 --> 25:29.920]  you can verify that that data came from the TPM that has some EK key because it's signed and you
[25:29.920 --> 25:39.600]  can verify the origin using the CA certificate provided by the manufacturer. So you can check that
[25:39.600 --> 25:46.480]  the TPM is exactly the one you expected using the EK certificate. Okay. Thank you for the talk.
[25:46.480 --> 26:00.240]  Thank you for all the questions. We are out of the diamond. Thank you.