Okay. Okay. So, yeah, then we'll continue basically right where Hansjord left off.
I'm Volker from KDE and I'll talk about how we do the semantic extraction in K-mail and
specifically focusing on the travel use case. Many of you probably traveled here, so you might see
why this could be useful. So, if you book your flight or your train or your hotel, you get the
confirmation as an HTML monstrosity full of advertisements and fine print and somewhere in
between is the information that you actually care about. So, you need to find that and transfer it
into like your calendar or your travel app and that if you do it manually, right, there's tedious
and error prone. So, why can't we have that automatically? And that's basically the point
that got me into that topic. I was on the way home from a conference needed to find my departure
gate and I was written in like light gray on white in that style. So, I did what you would do in
that case like you read the email source code because that's easier to read and I stumbled
about a nice compact summary of the trip and that was the schema.org Jason that Hansjord
mentioned. So, just showing that in our email client right that should be easy. Six and a half years
later, I'm not standing here and still talking about that subject, so as things usually go. So,
Hansjord showed us already, right, it's the schema.org Jason that is something that I think
Google proposed 10 or 15 years ago for websites and for HTML email. Meanwhile, managed by the W3C,
so it's a proper open standard. As an ontology that tries to model the complexities of the real
world, right, it has all the fun involved with that. But generally that is sane and something we
can work with. Then, however, we got in touch with the harsh realities out there because there's not
just that nice Jason format, there is also commonly used a micro data representation that basically
embeds that tree of information in the HTML structure of the HTML email. Technically,
that's possible and still well defined, but it then basically puts HTML parsing into your
problem space with all the fun that that entails. Well, okay, so we implemented that as well.
Then we discovered a third variant of encoding that information, basically syntactically
invalid Jason. Comalist Jason is particularly popular, so we ended up adding workarounds for
the Jason parser to deal with all of that. Then we found the actually much bigger problem and
that is semantically incorrect data. I think the most extreme case was Air Berlin. They had the
arrival and departure times for flights in the local time zone of the airports, as you would
usually do it. But then they added the UTC offset of what is presumably their server location.
So if you travel to the US, eight hour difference, you probably noticed that something is wrong.
If you travel from here to Finland, a subtle one hour difference, super dangerous,
you're under risk of missing your flight. Another common problem, there's an address and there's
a geo-coordinate. They mismatch and not just by a few meters. We have to deal with that as well.
Then of course the other big problem, this is by far not as widely used as we would wish.
You find it with some airlines, some of the hotel and event booking platforms have it.
It's super rare for trains. I think in Europe it's only a train line. In general, on a scale from
Silicon Valley startup to 100 plus year old European national railway, it's clearly biased
towards the former. It seems to be even less common in Asia than in Europe.
That isn't really satisfying, but at that point we were hooked and we really wanted those features.
We started to look where else we could get them from. There's actually a lot of stuff that we
can extract data from in such emails. One particularly useful thing are flight and train ticket barcodes,
which then moves PDF parsing and image processing in our problem space. It gets worse.
That thing is an entire world on its own. I spoke a bit about that last year in the
railways and open transport deff room. I tried to skip that here. Another thing commonly found
on booking emails is Apple wallet parsers that zip files containing JSON. Parts of it is machine
readable. Parts of it is visual representation, but at least for location and time in the barcode.
That's a good starting point. Then of course there is the whole unstructured human readable part.
For some of that we were able to build generic extractors. Something like an airline boarding
pass. They might look very different from a visual and layout point of view, but they can all be
very reliably identified using the barcode. The barcode only contains very basic information,
like the day of travel, not the year or the time, and only the airport codes, but not the gate,
and so on. All of that information that is really relevant for you is in that human readable text
somewhere. It's possible to identify that and match it. For everything else we have
provider specific extractor script. That's usually a few lines of
JavaScript with regular expressions or X pass queries on the HTML. Not pretty, but it gets the
job done. With all of those ways of getting data out, we still have the problem that the data
quality isn't really on a level that we can work with. In particular we care about the very
exact time, including the time zone. By time zone I really mean IA and A time zone ID, not UTC offset,
because if you have a delay over a day-life saving time change, and yes that does happen,
then you really need the exact time zone to know when your new departure time is.
And the other aspect that is really important is the precise location. So as a geocoordinate,
that in turn also helps with determining the time zone, but we want to have features like
routing to your departure location or your hotel. And in order to improve on the input data,
we use some external data sources like OpenStreetMap or VickyData to resolve airport or train
station identifiers and get to the exact location. And we have a few things that apply domain knowledge.
For example, if you email, we first to a flight from Brussels to Stuttgart,
and mentions a flight time of about an hour. There's two airports with Brussels in the name.
They are both close to, or at least both of them are in Belgium, so we know the country
and time zone. There's also two airports with the name Stuttgart. One is in southern Germany,
the other one is somewhere in the US. But based on the flight time, we know exactly which one of that
is possible, right? And I may have uniquely identified the other airport and so on and so on.
And then in the end, we have some validation and plausibility checks because they're still
either incomplete or nonsense coming through, right? So if you would require time travel to
make that trip, then it's likely wrong somehow. And that's then how it looks like in the integration.
So we run the current email through the extractor. If it finds something, it shows a summary of that
and offers you to add that to your calendar or to your travel app on the phone. This is in KML.
Originally, the extractor started as a library for KML, but it's also available as a standalone
command line tool by now. That's how we did the integration in NextCloud. Same thing, right? We
showed a summary of what we found and you can add that to your calendar. There used to be a
Thunderbird plugin, but Thunderbird changed the integration API and since then that stalled a
bit. There's a lot of demand for that, so it would be nice to redirect that at some point.
And then there's of course the dedicated travel app, it's a memory that we built out of all of this
that Hans-Jörg had already mentioned, where you get a timeline of your trip and it then
fills the gaps with local public transport and looks for the weather forecast and reminds you
to bring a power plug converter if you're traveling to a country where you need that.
And I mean, that is exactly the kind of high-level semantic features and workflows that we can
build if we actually understand what you're dealing with in your emails or in your documents.
So if you produce any kind of transactional email, you most likely have a machine-needable
representation of what this is about, so please add that also to the email in some form, ideally in
the format Hans-Jörg is working on, but as you have seen, we are not particularly picky in
extracting, right? So anything that isn't regular expressions on human readable text would be
a big help already. And then finally, I haven't mentioned that yet, all of that of course runs
on your device, right? Unlike Google, Apple or TripIt, we don't read your email for this.
That on the other hand means we have not as many training samples as they have,
so we entirely rely on people donating us travel-related emails in some form, so
that is one way to help. Yeah, and that's it. Thank you.
Thank you.
All right, again, we have our number one question to ask today.
Do you have any statistics on signal to noise ratio? Essentially,
how many times is the information wrong? Do you kind of any reviews or testing in terms of like,
you say that incorrect information is better than no information, but does it ever get
confusing to a user, for example? I mean, we try very, very hard to detect stuff that is not plausible
or to fit out anything that we at least can detect. How much gets through that is not detectable and
then confusing. I don't actually know because the samples we have work or are filtered out,
but at least we don't get a lot of bug reports with I missed my flight because it showed something
wrong, and usually it is individual providers and they are consistently wrong, so we can add
workarounds for that to filter them out and not show anything for them, for example. But there is
the risk for providers that we don't know. If they send out something that we can't detect,
we might show you a wrong departure time, right? And that is a problem.
But you could, I know you could not log, instead of not showing the possibly wrong information,
you could not log it somewhere and then to make those statistics.
I mean, log in the way that we get the information. Yeah, because it's not a website.
That would go against the whole privacy idea that we are very...
But if, I don't know, if user agree to send those kind of...
We don't have like a data donation feature built into the app right now. That might be an interesting
option. But some people send this to us then manually, basically. Yeah.
I might, before I give some mic to Arndt, I might just comment on that because we talked already
also at Mark to people and there is a lot of the email senders, right? So, and in general,
there is some interest by them to support this in a way. So, I have a strong assumption, like,
if there is such faulty data, there might be ways to incentivize at least the big senders,
the big brands to do it right. So, I'm not so concerned about that.
Yeah.
Asking people to send bug reports is okay, but if you ever get a mail client to send something
to you, to log it, you're going to get information about people's sex life.
No matter what you try to get, you're going to get that.
It just happens, trust me. And then you have GDPR problems because, well, you thought it was
the name of an airline, but it actually was the name of a person. Yeah, I mean, that is,
I mean, that's one of the motivation why we are so focused on doing this locally and
with keeping control over this. Because, I mean, your personal travel is already quite sensitive.
But if you combine that with everybody else, the amount of patterns you see,
right, I mean, all of us travel to Brussels in the first weekend in February. If that happens once,
right, that could be by chance. But if it happens in the next year as well, and after two or three
times, that is not random, right? Then there is some relation between the people involved.
And that allows you to do some scary network analysis.
If you're looking for the structured data that's already there, it's the open travel alliance.
First it was in XML horror. Now it's in JSON. So maybe that will be, can be implemented in the
final structure. Open travel alliance. Yeah, I don't know that one yet. No, it's international.
Everything is in there, the planes, the trains, boats. Okay.
Yeah, we, from the scheme of the world stuff, we support flights, trains, buses,
events, restaurant reservations, and ferries and boats. Yeah. But there's certainly more that
can be done. One quick final question. I wanted to remark that the anonymization of data fields
is possible without being able to trace it back to an individual human being.
Because airlines are innumerable. So you can get to the proverbial shouts, whereas user names or
people's names are not. And so you could hash everything into the WAHOOZA and still recognize
whether or not you should have recognized the field differently than what you've actually
rendered in a client in this case. Right. Yeah, but anonymization has turned out to be
rather tricky on input data like PDFs, where we also rely on the proper structure. So as soon
as you start to modify this, it's not sure that the extractor still detects it in the same way.
And we often don't know what kind of sensitive information is even in there or what the fields
in the back would mean when we start with a new format. Right. So it's very hard to predict what
we need to strike out. Sure, yes. But I thought we were talking about the JSON.
Once we have the JSON, sure. But the JSON alone is not really enough to fix the extractor. We
need the source document in its original form without modification to see where it goes wrong
in the extraction. So if there is proper JSON in the source, then yes, then the JSON is enough.
But if our source is a PDF document attached to the email and the barcode in there, then
I need the full thing to debug why we failed the extractor. I'm interested, but we'll take this
offline, I suppose. Yeah. Yeah. Right. A short technical question is Bogo in the room. Ah,
right. There he is. Great. All right. So thank you very much for that lively discussion. Thank
you, Falka, for the presentation. Once applause again.