[00:00.000 --> 00:14.760]  So, hey everyone, Daniel Almeida here from Colabra, and today we're going to be talking
[00:14.760 --> 00:20.680]  a little bit about stateless decoder virtualization using Verdeo Video and Rust, and mainly about
[00:20.680 --> 00:23.960]  what's the status of Verdeo Video in general.
[00:23.960 --> 00:31.400]  There's been this huge hiatus, and different companies now have different downstream patches
[00:31.400 --> 00:35.920]  going on, but recently there's been this new push to get everything upstream.
[00:35.920 --> 00:39.640]  There's been new conversation taking place in the mailing lists, so I think this is a
[00:39.640 --> 00:47.760]  good time to actually do a recap on the Verdeo Video device, and also to showcase how we're
[00:47.760 --> 00:53.240]  using Rust in our own criminalized implementation as Colabra has been working closely with the
[00:53.240 --> 00:58.920]  criminalized engineers to make this happen, so without further ado, I think we can get
[00:58.920 --> 01:00.240]  started.
[01:00.240 --> 01:06.920]  And the first question I think mainly everyone should ask themselves is like, why are we
[01:06.920 --> 01:12.960]  doing this, and how this is important, and the reason why is basically, I think it's
[01:12.960 --> 01:13.960]  two-fold.
[01:13.960 --> 01:19.520]  So, the first thing is that video data is this massive share of the internet traffic.
[01:19.520 --> 01:25.400]  There was some data that was collected by Cisco that predicted that by 2022, 82% of
[01:25.400 --> 01:37.040]  all consumer internet traffic would be like video-related data, up from 77% in 2018, so
[01:37.040 --> 01:42.760]  video data is this huge share of traffic, so this is one thing.
[01:42.760 --> 01:49.440]  The other thing is this new use case for Chromebooks, wherein you can purchase a Chromebook,
[01:49.440 --> 01:56.280]  so a laptop, an ultrabook of sorts, and then you can run Android applications on it, thanks
[01:56.280 --> 02:03.160]  to this ArcVM virtualization layer that they have going on that will transparently virtualize
[02:03.160 --> 02:10.160]  the Android apps so that you can use them on your Chromebook in a somewhat transparent
[02:10.160 --> 02:11.600]  way.
[02:11.600 --> 02:18.200]  This means that a user can use like Netflix, YouTube, and games, and things from the apps
[02:18.240 --> 02:23.760]  in the Play Store, and this makes the device much more useful in general.
[02:23.760 --> 02:28.840]  The idea is that these devices, these Chromebook devices are usually, they are capable of hardware
[02:28.840 --> 02:35.120]  accelerated video decoding, and if they can do that in a hardware accelerated way, it's
[02:35.120 --> 02:41.360]  a good idea to expose this capability to Android apps as well, so that Android apps can benefit
[02:41.360 --> 02:45.520]  from the hardware in the machine.
[02:45.520 --> 02:50.000]  With that said, before we can explore this a tad more, we should be talking a little
[02:50.000 --> 02:56.600]  bit about Viferal 2 memory to memory devices, and I have this figure here, which I've taken
[02:56.600 --> 03:02.200]  from Hans Rikul, the Viferal 2 maintainer, thanks Hans, and it just shows a codec device
[03:02.200 --> 03:06.560]  in the middle, and it's in the middle of two queues, so on the left side, you guys can
[03:06.560 --> 03:10.520]  see what we call the output queue, and on the right side, you guys can see what we call
[03:10.520 --> 03:17.760]  the capture queue, and the idea is that these two queues, they contain buffers, and a user
[03:17.760 --> 03:24.840]  space app will be continuously queuing and dequeuing buffers from these two queues, and
[03:24.840 --> 03:31.080]  the idea is for a decoder, which is the actual type of codec device I want to talk about,
[03:31.080 --> 03:36.440]  we could talk about encoders, but let's just focus on decoders for this presentation.
[03:36.440 --> 03:41.400]  The idea is that for a decoder device, user space applications are going to be queuing
[03:41.400 --> 03:48.560]  bitstream buffers on the output queue, so buffers containing compressed data using some
[03:48.560 --> 03:56.440]  codecs, so VP9, H264, HEVC, and this data will be eventually processed by the device,
[03:56.440 --> 04:01.520]  and then the device will be placing this data on the capture queue, on the buffers in the
[04:01.520 --> 04:07.120]  capture queue, and then eventually the user space app will be able to dequeue these buffers
[04:07.120 --> 04:14.080]  containing the decoder data, so this is a loop that takes place while the device is decoding,
[04:14.080 --> 04:21.120]  and there's also this finite state machine that drives the device as well, so one of
[04:21.120 --> 04:28.040]  the questions that one may make is why is there is this finite state machine in place,
[04:28.040 --> 04:35.280]  and the reason why is this model where we have two queues and the codec device in the
[04:35.280 --> 04:42.760]  middle, it's not sufficient to express some different kind of scenarios, so for instance
[04:42.760 --> 04:46.560]  if you're playing video and you want to do a seek to another position, something that
[04:46.560 --> 04:50.360]  happens very often, you're watching a given portion of a video, then you want to watch
[04:50.360 --> 04:57.200]  a different portion for whatever reason, that's called a seek, and the previous model with
[04:57.240 --> 05:03.320]  only the two queues, it's not capable of expressing that, so the idea is that you'll have a number
[05:03.320 --> 05:09.040]  of different states, and you can transition between these states by issuing ioctals against
[05:09.040 --> 05:14.800]  the video device node that you've presumably previously opened, so think like slash devs
[05:14.800 --> 05:20.160]  slash video zero or something along these lines, you will be opening this video node,
[05:20.160 --> 05:25.360]  and then you'll be issuing ioctals against this video node to transition between states,
[05:25.360 --> 05:29.400]  and eventually you're going to be in the decoding state where a decoding loop is going to take
[05:29.400 --> 05:36.680]  place, and then you'll be queuing and dequeuing buffers for the codec to process.
[05:36.680 --> 05:40.920]  With that said, I want to talk a little bit more about the differences between a stateful
[05:40.920 --> 05:45.560]  and a stateless architecture, and the main difference basically boils down to who's keeping
[05:45.560 --> 05:52.440]  track of the decoding state, so when you're decoding video, there's some state that somebody
[05:52.440 --> 05:58.400]  has to keep track of, usually the set of the decoded picture buffers, among other stuff,
[05:58.400 --> 06:03.560]  and in a stateful architecture, the driver or the firmware will be the piece that's going to be
[06:03.560 --> 06:10.160]  keeping track of that, whereas in a stateless architecture, the guest user space will be
[06:10.160 --> 06:15.920]  the actual program that's going to be keeping track of the decoding state, so in a stateless
[06:15.920 --> 06:21.120]  architecture, the device is just this clean slate that you serve program with some metadata that
[06:21.120 --> 06:27.680]  you extract from the beat stream, and it'll just process that one frame, whereas in a stateful
[06:27.680 --> 06:33.600]  architecture, you just send in data, and the device will be acting as a black box,
[06:33.600 --> 06:38.440]  where you just send it data, and eventually it's going to give you back decoded frames,
[06:38.440 --> 06:46.840]  so it's a different approach to do video decoding in general. With that said,
[06:47.160 --> 06:52.320]  we can talk a little bit about Verdeo Video, so Verdeo Video was initially developed by Google
[06:52.320 --> 06:58.600]  in Open Synergy, and Vistix is the latest submission upstream, there's a kernel driver
[06:58.600 --> 07:05.280]  submission for that, it refers to Verdeo Video version 3, and the Google has downstream patches
[07:05.280 --> 07:15.840]  to use that driver within Chrome OS, so using their own implementation basically, and in Verdeo
[07:15.840 --> 07:22.880]  Video, we basically have two vert cues, vert cues are like this, it's a cue where you can sort of
[07:22.880 --> 07:29.840]  communicate by exchanging memory, this is a Verdeo serve concept, and the idea is that we have two
[07:29.840 --> 07:36.000]  cues in Verdeo Video, one is the command cue, where we'll be pushing data from the driver to the
[07:36.000 --> 07:41.880]  device, so from guest to the host device for processing, we're going to be pushing data,
[07:42.080 --> 07:48.760]  pushing commands, and then we have the event cue where we have the opposite communication taking
[07:48.760 --> 07:57.200]  place, so that the host can inform the guest of things like dynamic resolution changes,
[07:57.200 --> 08:05.880]  or errors, or something along these lines. So the reason we were speaking about stateful
[08:05.880 --> 08:11.200]  and stateless implementations in Viferal 2 memory to memory devices is that the Verdeo Video
[08:12.080 --> 08:19.320]  kernel driver exposes itself as a Viferal 2 stateful device, and why Viferal 2 stateful? Well,
[08:19.320 --> 08:25.080]  first of all, it's a mature interface, it cover cases where the underlying decoder IP is not
[08:25.080 --> 08:31.200]  within a GPU, there are approaches out there that are trying just to like virtualize VA API,
[08:31.200 --> 08:36.360]  or something along these lines, but we really wanted to cater as well for the case where the
[08:36.360 --> 08:42.000]  decoder IP is not within a GPU, because we have devices where this is precisely the case,
[08:42.000 --> 08:47.720]  and also a black box approach is really useful, because we just want to send it data, and we
[08:47.720 --> 08:57.480]  wanted to do its decoding in the background without the guest application being aware that
[08:57.480 --> 09:03.120]  there's this entire virtualization layer going on in the background, and this driver is also heavily
[09:03.120 --> 09:09.840]  based on the Verdeo GPU, which is also upstream in the Linux kernel. So the idea for the kernel
[09:09.840 --> 09:15.160]  driver is really simple, it translates from the Viferal 2 iOctals to Verdeo Video Commands. So
[09:15.160 --> 09:22.560]  the guest user space app will be, as we said, previously issuing Viferal 2 iOctals against
[09:22.560 --> 09:30.240]  this video device node, so that it can sort of change states to eventually end up in the decoding
[09:30.240 --> 09:37.600]  state, and also so that it can be, while it's in the decoding state, so that it can be queuing and
[09:37.600 --> 09:45.360]  dequeuing buffers in that decoding loop. So whenever the device issue iOctals against the video
[09:45.360 --> 09:53.320]  node, the kernel driver will be translating that into Verdeo Video Commands, and then placing these
[09:53.320 --> 09:58.920]  commands in the command queue for further processing by the host device, and by doing this
[09:58.960 --> 10:05.760]  translation, it ends up implementing the Viferal 2 Stateful Finite Stay Machine, and so that a guest
[10:05.760 --> 10:09.720]  user space app doesn't really have to know that there's any virtualization taking place, it just
[10:09.720 --> 10:16.200]  submits iOctals, submits data in the buffers, and eventually it's able to dequeue buffers with
[10:16.200 --> 10:25.840]  the decoded data in them. So here's a small example just to drive home what I'm trying to say here,
[10:26.080 --> 10:32.480]  we have one iOctal being issued by the guest user space in this figure, and in this particular case
[10:32.480 --> 10:38.160]  it's video C create buffs or rec buffs, it's another call as well rec buffs, which is just a way for
[10:38.160 --> 10:45.120]  the guest to say that, or for a user space app to save to Viferal 2 that he wants to allocate
[10:45.120 --> 10:52.320]  buffers. So the Verdeo Video Kernel Driver, which is again a Viferal 2 Stateful device, will
[10:52.320 --> 10:58.400]  intercept that call, it will translate that call into some Verdeo Video Command, it'll place that
[10:58.400 --> 11:04.080]  command in the command queue for processing by the host, and then the host will be talking to
[11:04.080 --> 11:10.080]  this question mark box somehow to process this Verdeo Video Command resource create into something
[11:10.080 --> 11:17.440]  useful. So here's the architecture thus far, we have a, in the guest we have a guest user space
[11:17.440 --> 11:22.320]  app issuing iOctals against the Verdeo Video Kernel Driver, the Verdeo Video Kernel Driver
[11:22.320 --> 11:29.840]  translating these iOctals into Verdeo Video Specific Commands. For our use case here, we have CrossVM
[11:29.840 --> 11:36.560]  which is Google's virtual machine manager, taking these commands in the command queue,
[11:36.560 --> 11:42.720]  dequeuing them, and then processing them using this question mark shaded box in the host.
[11:43.680 --> 11:51.360]  Eventually this shaded box will be somehow decoding the video data, and it'll be piping the
[11:51.360 --> 11:56.320]  the frames back to CrossVM, and then CrossVM will be pushing the frames back using the
[11:57.360 --> 12:02.000]  the virtual queues back to the Verdeo Video Kernel Driver, and then the Verdeo Video Kernel Driver
[12:02.000 --> 12:08.240]  can make the frames available to the guest user space application, which can be like gstreamer,
[12:08.320 --> 12:20.000]  or ffmpeg, or you know other apps. So, and now we have to talk a little bit about what is that
[12:20.000 --> 12:27.600]  shaded question mark box, and these are like CrossVM decoder backends. So what is CrossVM in
[12:27.600 --> 12:32.400]  the first place? Well, CrossVM is this virtual machine manager that's shipped with GromoS,
[12:32.480 --> 12:40.480]  and it's the cornerstone of GromoS virtualization layer. So it's when, for instance, when you're
[12:40.480 --> 12:46.800]  running Android apps in the background, it's CrossVM that's going to be providing the virtualization
[12:46.800 --> 12:53.120]  for it, and it has this huge focus on security. So it's written in Rust, and it's focused on
[12:53.120 --> 13:00.320]  Verdeo devices. And the main idea here being that CrossVM as a virtual machine manager really has
[13:00.320 --> 13:05.280]  no idea how to decode video. This is a very different thing from what it was built to do,
[13:05.280 --> 13:10.320]  so it has to interface with something to get that video data decoded. And that's something
[13:10.320 --> 13:15.120]  which I've denoted with that shaded question mark box, that's something is what we call
[13:16.480 --> 13:25.520]  a back end for CrossVM. So we have like three different backends going on nowadays for CrossVM,
[13:26.080 --> 13:34.800]  the first of which being LibVDA. So the idea with LibVDA is pretty simple. LibVDA is just this
[13:34.800 --> 13:41.680]  library that lets you interface with the Chromium GPU process to actually decode video data. So
[13:43.760 --> 13:49.200]  most of us here know that Chromium is this very mature project with a very mature video
[13:49.280 --> 13:54.800]  decoding stack. So the idea is simple, just use Chromium, ask Chromium to decode data,
[13:54.800 --> 14:00.400]  bam, there you go. But this has a major issue, which is like we have a virtual machine manager
[14:00.400 --> 14:04.960]  which is written in Rust with a focus on security and memory safety and everything,
[14:05.600 --> 14:09.280]  linking against a web browser which is a very different kind of software,
[14:09.920 --> 14:14.960]  which is also by the way not written in Rust. So this is a problem, this is something that
[14:14.960 --> 14:20.880]  the Chromium engineers wanted to do away with, which is why we have like CrossCodeX
[14:21.920 --> 14:27.840]  going on, which is our own crate written in Rust. We also have like an FFM pack back end using the
[14:27.840 --> 14:35.360]  FFM pack software decoders. It's used only for testing so that you can test the radio video
[14:35.360 --> 14:42.480]  implementation in CrossVM without necessarily owning a Chromebook device, you can test that
[14:43.280 --> 14:51.280]  with a regular laptop if you're using the FFM pack back end and also if you're using CrossCodeX.
[14:52.000 --> 14:59.040]  So the idea of FFM pack is just to use it just for testing and we're not integrating it like the
[14:59.040 --> 15:06.080]  hardware like Acceleration and FFM pack because again FFM pack is this huge project written in C
[15:06.880 --> 15:13.120]  which brings us to the topic of CrossCodeX which is our solution. CrossCodeX is basically a crate,
[15:13.120 --> 15:21.680]  a library written in Rust to do video decoding in safe Rust with all the guarantees that the
[15:21.680 --> 15:28.560]  Rust language provide to us, so memory safety, so on and so forth. It's not published on crates.io
[15:28.560 --> 15:34.560]  yet because it's heavily working progress and it contains all the pieces that are necessary in
[15:34.560 --> 15:40.480]  order to do video decoding. So mainly parsers which is where we're going to be extracting the
[15:40.480 --> 15:47.680]  metadata to drive the decoder, the decoder logic which is the piece that's going to be keeping
[15:47.680 --> 15:52.960]  track of things like of the state right which is what we've talked about previously, so things like
[15:52.960 --> 15:59.520]  the set of reference frames and any other kind of information that you have to keep track of
[15:59.520 --> 16:08.880]  between frames and also it itself contains backends as we will see shortly. Currently we have a VA API
[16:08.880 --> 16:17.360]  backend so CrossCodeX will itself use the VA API driver in the system to get a video decoded
[16:17.360 --> 16:22.640]  and they're also working on another backend which is the Vifero 2 stateless backend.
[16:23.600 --> 16:32.960]  So here is a more complete picture I think, so everything in this picture is just that
[16:32.960 --> 16:39.360]  shaded question mark box from earlier, so here you guys can see we have CrossVM, CrossVM will be
[16:39.360 --> 16:48.640]  using for now, CrossVM will be using like CrossCodeX to decode video, CrossCodeX will be using the VA
[16:48.720 --> 16:54.800]  API CrossCodeX backend, it'll be talking to the VA API implementation in the system which will be
[16:54.800 --> 17:02.720]  talking to the VA API driver, so Intel Media Driver or Mesa depending on what graphics card you're
[17:02.720 --> 17:10.720]  using that's going to be talking to DRM in the host kernel and then up until now nobody really
[17:10.720 --> 17:16.240]  knows how to decode video data, but once DRM starts talking with the GPU then the GPU knows
[17:16.240 --> 17:22.800]  how to decode video because it has an IP in there, a circuitry that is specialized in video
[17:22.800 --> 17:29.600]  decoding so the GPU will be doing the video decoding and eventually getting the raw decoded data
[17:30.160 --> 17:35.120]  and then the data will be pushed all the way back until it gets to CrossVM, then when it gets to
[17:35.120 --> 17:40.160]  CrossVM we'll be pushing the data back to the radio video kernel driver in the guest now and then
[17:40.160 --> 17:44.960]  the radio video kernel driver in the guest can make the decoded data available to the
[17:47.680 --> 17:54.720]  guest user space application. So here's some backlog, we still have to upstream the radio
[17:54.720 --> 18:00.720]  video protocol, as I said there's been this new push to get everything upstream, Google is collaborating
[18:00.720 --> 18:08.880]  again with open synergy so that we can get video upstream because it's not upstream yet,
[18:08.880 --> 18:14.640]  we plan on adding more codec support because thus far we only have VP8, VP9 and H264
[18:15.440 --> 18:22.480]  supported and most people want to see HEVC and AV1 which is like the state of the art for
[18:22.480 --> 18:29.840]  like video decoders. Encoder support for cross codecs in particular while a radio video itself
[18:29.840 --> 18:36.960]  has encoder support, the Google's implementation which involves like cross codecs does not
[18:37.200 --> 18:43.600]  the cross codecs grade does not have encoder support yet, you can encode using libvda which
[18:43.600 --> 18:51.200]  which again is this path that uses the the Chromium GPU process to do video encoding,
[18:52.320 --> 18:58.240]  it's and this is already used in production so it can be used but there's no supporting cross
[18:58.240 --> 19:03.600]  codecs properly yet and we're also working on a referral to stateless back-end in cross codecs
[19:03.600 --> 19:11.200]  so that we can support like more devices. So just a quick summary, Google is already using
[19:11.200 --> 19:16.720]  radio video in production through libvda, we've been working with Chromium as engineers so that
[19:16.720 --> 19:22.560]  the libvda dependency can be removed because Google really wants this to be using Rust,
[19:22.560 --> 19:29.840]  to be using safe Rust code in order to do the video decoding, we plan to upstream radio video
[19:29.840 --> 19:34.800]  like collaboration and working together with Google and working together with other industry
[19:34.800 --> 19:41.760]  players, we plan on upstreaming the radio video protocol and for Google in particular this improves
[19:41.760 --> 19:48.240]  the experience for Chromebook users and not only that but this is like only one application for
[19:48.240 --> 19:54.400]  radio video in general, other companies can can benefit from the radio video work that's been
[19:54.400 --> 20:00.480]  done here to use radio video for their own projects and their own use cases. So that was
[20:00.480 --> 20:06.480]  it, that was basically what I had to say about radio video, I hope that was informative and yeah,
[20:06.480 --> 20:18.240]  thank you very much!