So, our next speaker is Joanna White and we talk about the system of television affair
recording and archiving.
BFI National Television Archive.
Welcome, Herr.
Thank you.
Thank you.
It's wonderful to be here today.
Thank you for coming and thank you to FOS STEM for letting us speak here.
I am Joanna White, developer at the BFI National Archive in the Data and Digital Preservation
Department.
Today I'll be talking briefly about STORA, System for Television Offair Recording and
Archiving.
It's a project that we've built in-house.
So the BFI or the British Film Institute promote and preserve film and television across the
UK and the BFI National Archives Department within the BFI and is also one of the largest
archives in the world.
So we have nearly one million digitized moving image assets in our digital preservation
infrastructure or DPI as we call it.
That means they've been ingested into our Spectralogic tape libraries for long-term preservation
and they've also been catalogued in our collections information database, what we call SID.
By far the largest collection of moving image materials in our off-air is our off-air television
recordings with nearly 650,000 program files in DPI.
You can see a selection of them here displayed.
This is our staff DPI browser.
It's internal.
There's also a further 800,000 preserved.
This is off-air recordings waiting to be processed and ingested and seeded in a future project.
So the BFI is the body designated by OFCOM as the National Television Archive.
Under the Provision and the Broadcasting Act of 1990, the designation allows us to record,
observe and make accessible TV off-air under section 75 of the Copyright Designs and Patents
Act of 1988 and later the Copyright and Rights and Performance Regulations 2014.
Okay, that's the official bit.
The BFI National Archive began recording off-air TV to one-inch real videotapes as you can
see here in 1985 with the permission of select UK broadcasters.
Programs were captured, curatorially chosen, captured by teams who would work there around
the clock in shifts.
In 2015, off-air TV recording became an automated process for us when we started collecting
live TV programs 24-7.
To do this, the BBC agreed to provide us with a fork of their Redux Off-air Capture Project,
which you can see here.
We worked with BBC developers to integrate it into our digital preservation infrastructure.
The goal was to store MPEG TS files to our Spectrologic Tape Libraries for long-term storage.
This is built on open-source technology.
It's run from Linux, installed servers and uses open-source tools to record both television
and radio programming for the BBC.
At the BFI, we just use it for off-air television.
So in May 2022, BBC Redux was shut down.
In anticipation, the head of my department, head of data and digital preservation, Stephen
McConnacky, launched our own R&D project the year before.
Along with two BFI engineers, John Daniel and Brian Fattarini, we built the software recording
solution to emulate many features of Redux with the name not to disrupt our existing
DPI ingest workflows during that change over period.
So like Redux, Stora records satellite-streamed UK broadcasts.
The channels are a mix of high definition and standard definition streams, many broadcasting
24 hours a day.
One full day of off-air recording captures around 500 programs to our storage.
That's roughly 850 gigabytes of data, and that's roughly 300 terabytes every year.
So we receive our signals from Astra Satellites, which broadcast mostly direct-to-home TV channels
in Europe.
It is nice to be considered still in Europe in this regard.
They're received by our satellite dishes, passed through Quattro low-noise blocks before
passing through TVS, TV, PCI receiver cards.
The signals are routed through patch fields to a multi-switch, which selects band and
polarization.
We use three multi-switches for Stora so we can have 24 potential multiplexes.
We've got a SESPA application, which demuques each channel's MPEG transport stream into
a single program transport stream, creating a Unicast real-time transport protocol, or
RTP stream, and a Unicast user datagram protocol, or UDP stream.
We need both for our recording method.
If you'd like to know more about the hardware setup, I can put you in touch with my colleague.
It's not my area, I'm afraid.
For those of you who are familiar with BBC Redux, you may recognise the folder naming
convention and the contents of the folders.
As I said, we have automated ingest workflows that needed this structure to be maintained.
The folder path comprises recording date, channel name, and individual program broadcast
time data in the name of the folder.
We've got also the Unic event ID, which is for the program that's being shown, in this
case 476.
With the folder, you'll find three files, the Info CSV.
This file contains program information, including channel, title, description, etc.
Next, we have the Stream MPEG TS file.
This is the recording captured from the broadcast.
This is not encoded stream, but it's just dumped directly to storage, so it contains
the packetised elementary streams, which wrap the main data stream, usually H264, video
codec, AC3, or MPEG audio, subtitles, also in there, and information tables.
You can view all this data really nicely when you look at it in VLC.
Finally, we have the subtitles in there, which contains an extracted transcript of all the
spoken word from the program.
It's formatted as a Web Video Text Tracks format, or Web VTT.
Making sure that we don't lose any of this information is really critical to our preservation
goals.
Storage code has been made possible by a wonderful collection of open source tools, which you
can see here.
We have Linux Ubuntu operating systems, and we use Linux command line tools throughout
the code.
Storage is written in Python, and a few external libraries such as Tenacity and Python VLC.
Python VLC allows us to work easily in the code with the amazing software VLC from Video
LAN.
You'll probably see them, I'll foster them in the hats.
VLC relies on the outstanding FFMPEG libraries to operate.
FFMPEG is kind of worshipped at the BFI and in many archives globally.
LibdVBT passes service information in the UDP streams, and it's key to how the scripts
record the programs.
Media Info provides detailed technical metadata for analysis of the MPEG TS files.
CC Extractor extracts the subtitles from the MPEG TS file, saving them to a separate formatted
file, and Nagios Core provides a monitoring service for real-time alerts when streams
fail or recordings stop for us.
So I'll quickly talk you through how storage uses these pieces of software.
We'll look first at the recording script, which makes the file contain the MPEG transport
stream.
They used to have two recording methods for the storage code base, but they've been merged
into one script now recently.
I'll unpack that shortly.
Both methods capture the MPEG transport stream using VLC, but they differ in how they start
and stop the recording methods.
So the first script I wrote utilizes electronic program-grade metadata, which you can see
at the top.
We get this from a commercial supplier, retrieved daily from their REST API.
The EPG data is converted in Python into a JSON schedule for a single day's programs.
One is created every day for every channel.
Recordings are then prompted to start and stop from this JSON schedule.
The script loops through every scheduled item before it then exits at the end of the last
program, which usually just after midnight.
And then we have shell restart scripts that run from Prontab, which immediately restart
the script again, and it picks up the next day's schedule and carries on.
Quick shout out here.
I'm quite a new developer, and when I had this project placed on my plate, it was a little
bit overwhelming, but I came across this script.
It was on ActiveState code written in 2015, weirdly also written by somebody named J-White,
J-White88.
If anyone knows them, please thank them for me.
Nobody knows them.
I'm going to assume time travel is a thing by the time I'm 88, and I come back in time
and give this to myself, which is a nice idea.
So onto the second and better method for recording the off-air streams.
It monitors the user data-gram protocol stream, UDP stream, and it gets the service information
data, watches for changes in the event ID field for that broadcast stream.
You can see that in the top.
The event ID is that unique identifier for a program.
The script stores the last five event IDs that have been broadcast, and if a new one
turns up, then it knows that there's a new recording that needs to be triggered.
So it should potentially loop indefinitely, monitoring a UDP stream in this way, creating
and placing TV shows into their own unique folder paths, which you've seen.
And these event IDs changes usually always fall right at the beginning of a new program
as it starts to record.
So it's a really very neat way to start and stop the recordings in the schedule.
And another shout-out is needed here for the open-source project Libdvbt.
I think it's a fork from a VLC library, I'm not sure, but it's by Michael Krufke.
It's the stream parser and service information aggregator library for MPEG-2 transport streams.
The recording script calls up dvbt from a Python sub-process spawn shell, captures the
Libdvbt JSON-formatted response.
The command has a time-out flag, which usually ensures the information is returned to you
within two, three, four, five seconds.
This response is reformatted and exported into a Python dictionary, and this provides
the trigger for the VLC record start-stop process.
So just to visualize how this method works, it does require us to have two streams, which
is a little bit awkward, but doesn't really cause us any problems.
So here you can see that the script monitors UDP stream waiting for an event ID number
change in that stream, so from two, six, five, two, four, five, two, six, four, two, six,
five.
When the event ID changes, it's sensed the current VLC streaming recording is stopped
on the RTP stream, and the new folder is created with the start time and duration of the next
program.
So in this folder, the RTP stream is placed, captured by VLC.
And this is the code used to start and stop the VLC recording.
The Python bind needs to create a VLC instance from the instance class in Python VLC and
initiate a new media player object.
Both are called into the main script to start and stop the recordings.
We use the demuxt dump command, which uses a VLC unique codec from the demuxt library,
a tool developed essentially for debugging, but it actually dumps the content directly
to file without decoding it.
I have the append flag also in there so that if a recording breaks midway through a program
and then starts again, it will append it to the existing file and not overwrite it.
If that happens, a restart warning text file is placed into the channel folder with the
date and timestamps so that we can know that there's potentially a break in the stream.
This is pretty rare though, it doesn't happen very often.
So we also rely on media info software in the get stream info script.
It uses the Python sub process again to spawn a media info call capturing the program start
duration metadata.
This is all then dumped into a CSV file.
And then to extract the WebVTT files, we use the software CC extractor.
We launch the software and the Python script again from sub process.
Sub process is so important to these processes.
This is a simple command that flags the WebVTT output format and then creates the file that
you can see here.
We then import this data into our SID database, which is viewable and searchable and provides
a rich text metadata for the curatorial teams.
Lastly, we have Nagios, which is an event monitoring system, which issues alerts when
problems are detected.
We have separate channel alerts for recording failures, which is identified by comparing
a checksum between the current stream MPEG TS file and one four seconds earlier.
And then we also have a stream check, which looks in the Cesbo software for an on air
equals true for every channel.
If either of those fail, then we get a display that says critical, but also we get an email
that's sent to us with the context for what the failure is.
Okay, so that's a rough guide to the store.
In particular, how the code interacts with these open source projects.
The open source repository contains all the store of scripts, descriptions for the code
base, dependencies, environmental variables, and quantum launch details.
It has an MIT license.
I hope it may be of some interest here.
But as a relatively new developer, I'm quite welcome.
I welcome kind of feedback and advice.
None of the team in the data and digital preservation department have computer science backgrounds.
They're all archivists or TV people.
I used to be a cameraman and an independent documentary maker.
To be able to stand here and talk about this project like Stora, with just a few years
coding experience is really mind blowing for me.
And particularly at a time when accurately recording our televised social history is
really just so critical.
So this has really been made possible thanks to the open source tools we use and the developers
we see in the room here.
Thank you from the archiving world.
And there's also quickly a growing interest in audio visual archives globally to try and
work more with open source software and standards.
Many of us meet annually at a conference called the No Time to Wait conference, which happens
here in Europe.
We welcome new attendees, who are developers, definitely.
This conference has been connected with the development of the FFE1 codec, which was originally
an FFMPEG project picked up and expanded by archivists working as developers.
This codec is critical to the BFI's long term preservation of thousands of video and film
assets.
So the maintenance and upkeep of projects like FFMPEG is really very important to us.
Traditionally archives have relied on expensive proprietary hardware, software and codecs that
are not scalable.
They keep their information behind paywalls and they're not likely to offer the kind
of technical support we need long enough into the future for long term preservation.
So having open workflows and standards developed within our own community is incredibly empowering
for us.
And yeah, this is the community where it's happening most, I would say, at the moment
in the UK, in Europe.
That's it.
Thank you.
Thank you.
The next talk will be in five minutes.