So, our next speaker is Joanna White and we talk about the system of television affair recording and archiving. BFI National Television Archive. Welcome, Herr. Thank you. Thank you. It's wonderful to be here today. Thank you for coming and thank you to FOS STEM for letting us speak here. I am Joanna White, developer at the BFI National Archive in the Data and Digital Preservation Department. Today I'll be talking briefly about STORA, System for Television Offair Recording and Archiving. It's a project that we've built in-house. So the BFI or the British Film Institute promote and preserve film and television across the UK and the BFI National Archives Department within the BFI and is also one of the largest archives in the world. So we have nearly one million digitized moving image assets in our digital preservation infrastructure or DPI as we call it. That means they've been ingested into our Spectralogic tape libraries for long-term preservation and they've also been catalogued in our collections information database, what we call SID. By far the largest collection of moving image materials in our off-air is our off-air television recordings with nearly 650,000 program files in DPI. You can see a selection of them here displayed. This is our staff DPI browser. It's internal. There's also a further 800,000 preserved. This is off-air recordings waiting to be processed and ingested and seeded in a future project. So the BFI is the body designated by OFCOM as the National Television Archive. Under the Provision and the Broadcasting Act of 1990, the designation allows us to record, observe and make accessible TV off-air under section 75 of the Copyright Designs and Patents Act of 1988 and later the Copyright and Rights and Performance Regulations 2014. Okay, that's the official bit. The BFI National Archive began recording off-air TV to one-inch real videotapes as you can see here in 1985 with the permission of select UK broadcasters. Programs were captured, curatorially chosen, captured by teams who would work there around the clock in shifts. In 2015, off-air TV recording became an automated process for us when we started collecting live TV programs 24-7. To do this, the BBC agreed to provide us with a fork of their Redux Off-air Capture Project, which you can see here. We worked with BBC developers to integrate it into our digital preservation infrastructure. The goal was to store MPEG TS files to our Spectrologic Tape Libraries for long-term storage. This is built on open-source technology. It's run from Linux, installed servers and uses open-source tools to record both television and radio programming for the BBC. At the BFI, we just use it for off-air television. So in May 2022, BBC Redux was shut down. In anticipation, the head of my department, head of data and digital preservation, Stephen McConnacky, launched our own R&D project the year before. Along with two BFI engineers, John Daniel and Brian Fattarini, we built the software recording solution to emulate many features of Redux with the name not to disrupt our existing DPI ingest workflows during that change over period. So like Redux, Stora records satellite-streamed UK broadcasts. The channels are a mix of high definition and standard definition streams, many broadcasting 24 hours a day. One full day of off-air recording captures around 500 programs to our storage. That's roughly 850 gigabytes of data, and that's roughly 300 terabytes every year. So we receive our signals from Astra Satellites, which broadcast mostly direct-to-home TV channels in Europe. It is nice to be considered still in Europe in this regard. They're received by our satellite dishes, passed through Quattro low-noise blocks before passing through TVS, TV, PCI receiver cards. The signals are routed through patch fields to a multi-switch, which selects band and polarization. We use three multi-switches for Stora so we can have 24 potential multiplexes. We've got a SESPA application, which demuques each channel's MPEG transport stream into a single program transport stream, creating a Unicast real-time transport protocol, or RTP stream, and a Unicast user datagram protocol, or UDP stream. We need both for our recording method. If you'd like to know more about the hardware setup, I can put you in touch with my colleague. It's not my area, I'm afraid. For those of you who are familiar with BBC Redux, you may recognise the folder naming convention and the contents of the folders. As I said, we have automated ingest workflows that needed this structure to be maintained. The folder path comprises recording date, channel name, and individual program broadcast time data in the name of the folder. We've got also the Unic event ID, which is for the program that's being shown, in this case 476. With the folder, you'll find three files, the Info CSV. This file contains program information, including channel, title, description, etc. Next, we have the Stream MPEG TS file. This is the recording captured from the broadcast. This is not encoded stream, but it's just dumped directly to storage, so it contains the packetised elementary streams, which wrap the main data stream, usually H264, video codec, AC3, or MPEG audio, subtitles, also in there, and information tables. You can view all this data really nicely when you look at it in VLC. Finally, we have the subtitles in there, which contains an extracted transcript of all the spoken word from the program. It's formatted as a Web Video Text Tracks format, or Web VTT. Making sure that we don't lose any of this information is really critical to our preservation goals. Storage code has been made possible by a wonderful collection of open source tools, which you can see here. We have Linux Ubuntu operating systems, and we use Linux command line tools throughout the code. Storage is written in Python, and a few external libraries such as Tenacity and Python VLC. Python VLC allows us to work easily in the code with the amazing software VLC from Video LAN. You'll probably see them, I'll foster them in the hats. VLC relies on the outstanding FFMPEG libraries to operate. FFMPEG is kind of worshipped at the BFI and in many archives globally. LibdVBT passes service information in the UDP streams, and it's key to how the scripts record the programs. Media Info provides detailed technical metadata for analysis of the MPEG TS files. CC Extractor extracts the subtitles from the MPEG TS file, saving them to a separate formatted file, and Nagios Core provides a monitoring service for real-time alerts when streams fail or recordings stop for us. So I'll quickly talk you through how storage uses these pieces of software. We'll look first at the recording script, which makes the file contain the MPEG transport stream. They used to have two recording methods for the storage code base, but they've been merged into one script now recently. I'll unpack that shortly. Both methods capture the MPEG transport stream using VLC, but they differ in how they start and stop the recording methods. So the first script I wrote utilizes electronic program-grade metadata, which you can see at the top. We get this from a commercial supplier, retrieved daily from their REST API. The EPG data is converted in Python into a JSON schedule for a single day's programs. One is created every day for every channel. Recordings are then prompted to start and stop from this JSON schedule. The script loops through every scheduled item before it then exits at the end of the last program, which usually just after midnight. And then we have shell restart scripts that run from Prontab, which immediately restart the script again, and it picks up the next day's schedule and carries on. Quick shout out here. I'm quite a new developer, and when I had this project placed on my plate, it was a little bit overwhelming, but I came across this script. It was on ActiveState code written in 2015, weirdly also written by somebody named J-White, J-White88. If anyone knows them, please thank them for me. Nobody knows them. I'm going to assume time travel is a thing by the time I'm 88, and I come back in time and give this to myself, which is a nice idea. So onto the second and better method for recording the off-air streams. It monitors the user data-gram protocol stream, UDP stream, and it gets the service information data, watches for changes in the event ID field for that broadcast stream. You can see that in the top. The event ID is that unique identifier for a program. The script stores the last five event IDs that have been broadcast, and if a new one turns up, then it knows that there's a new recording that needs to be triggered. So it should potentially loop indefinitely, monitoring a UDP stream in this way, creating and placing TV shows into their own unique folder paths, which you've seen. And these event IDs changes usually always fall right at the beginning of a new program as it starts to record. So it's a really very neat way to start and stop the recordings in the schedule. And another shout-out is needed here for the open-source project Libdvbt. I think it's a fork from a VLC library, I'm not sure, but it's by Michael Krufke. It's the stream parser and service information aggregator library for MPEG-2 transport streams. The recording script calls up dvbt from a Python sub-process spawn shell, captures the Libdvbt JSON-formatted response. The command has a time-out flag, which usually ensures the information is returned to you within two, three, four, five seconds. This response is reformatted and exported into a Python dictionary, and this provides the trigger for the VLC record start-stop process. So just to visualize how this method works, it does require us to have two streams, which is a little bit awkward, but doesn't really cause us any problems. So here you can see that the script monitors UDP stream waiting for an event ID number change in that stream, so from two, six, five, two, four, five, two, six, four, two, six, five. When the event ID changes, it's sensed the current VLC streaming recording is stopped on the RTP stream, and the new folder is created with the start time and duration of the next program. So in this folder, the RTP stream is placed, captured by VLC. And this is the code used to start and stop the VLC recording. The Python bind needs to create a VLC instance from the instance class in Python VLC and initiate a new media player object. Both are called into the main script to start and stop the recordings. We use the demuxt dump command, which uses a VLC unique codec from the demuxt library, a tool developed essentially for debugging, but it actually dumps the content directly to file without decoding it. I have the append flag also in there so that if a recording breaks midway through a program and then starts again, it will append it to the existing file and not overwrite it. If that happens, a restart warning text file is placed into the channel folder with the date and timestamps so that we can know that there's potentially a break in the stream. This is pretty rare though, it doesn't happen very often. So we also rely on media info software in the get stream info script. It uses the Python sub process again to spawn a media info call capturing the program start duration metadata. This is all then dumped into a CSV file. And then to extract the WebVTT files, we use the software CC extractor. We launch the software and the Python script again from sub process. Sub process is so important to these processes. This is a simple command that flags the WebVTT output format and then creates the file that you can see here. We then import this data into our SID database, which is viewable and searchable and provides a rich text metadata for the curatorial teams. Lastly, we have Nagios, which is an event monitoring system, which issues alerts when problems are detected. We have separate channel alerts for recording failures, which is identified by comparing a checksum between the current stream MPEG TS file and one four seconds earlier. And then we also have a stream check, which looks in the Cesbo software for an on air equals true for every channel. If either of those fail, then we get a display that says critical, but also we get an email that's sent to us with the context for what the failure is. Okay, so that's a rough guide to the store. In particular, how the code interacts with these open source projects. The open source repository contains all the store of scripts, descriptions for the code base, dependencies, environmental variables, and quantum launch details. It has an MIT license. I hope it may be of some interest here. But as a relatively new developer, I'm quite welcome. I welcome kind of feedback and advice. None of the team in the data and digital preservation department have computer science backgrounds. They're all archivists or TV people. I used to be a cameraman and an independent documentary maker. To be able to stand here and talk about this project like Stora, with just a few years coding experience is really mind blowing for me. And particularly at a time when accurately recording our televised social history is really just so critical. So this has really been made possible thanks to the open source tools we use and the developers we see in the room here. Thank you from the archiving world. And there's also quickly a growing interest in audio visual archives globally to try and work more with open source software and standards. Many of us meet annually at a conference called the No Time to Wait conference, which happens here in Europe. We welcome new attendees, who are developers, definitely. This conference has been connected with the development of the FFE1 codec, which was originally an FFMPEG project picked up and expanded by archivists working as developers. This codec is critical to the BFI's long term preservation of thousands of video and film assets. So the maintenance and upkeep of projects like FFMPEG is really very important to us. Traditionally archives have relied on expensive proprietary hardware, software and codecs that are not scalable. They keep their information behind paywalls and they're not likely to offer the kind of technical support we need long enough into the future for long term preservation. So having open workflows and standards developed within our own community is incredibly empowering for us. And yeah, this is the community where it's happening most, I would say, at the moment in the UK, in Europe. That's it. Thank you. Thank you. The next talk will be in five minutes.