Hi, I'm Emily. I work for Mozilla. Yeah, so this is a talk I literally don't think I've
got. I could give any wealth except an audience like the translations of ROOM at Vosdom. So
I thought I did. I would. In my work at Mozilla on localization systems and tools and standards,
recently I've ended up spending quite a bit of time participating in the Unicode Consortium's
project to define message format 2, an evolution of the ICU message format standard and a bunch
of other things. And I'm here not talking about that like specifically, but more like a side
product of what we've ended up doing through that work, which is defining a data model for
messages. In particular, messages that are not just a single segmented phrase that you've extracted
and you might be able to send it to translation, but more dynamic content as well as everything else.
And one of the interesting things about what we've ended up effectively not discovering,
but kind of stating the obvious, is that there's an upper bound to this sort of what makes up a
localizable segmented phrase or message really. That this is limited by the fact that the keyword
localizable there because it's dealing with humans. Humans who need to understand it,
but also translators who, well still, are mostly humans who need to be able to take in the
the source message and be able to produce an output from that that is understandable in their
locale. And this ends up depending on a limited number of different dimensions in which messages
kind of vary. Variants have kind of hidden it there as the first one and there of course spoiled
everything by saying so. It's the way that messages, message content can vary depending on
inputs like numbers and their pluralization categories. You have no apples, you have one apple,
you have 15 apples and gender-based determinants, grammatical gender, personal gender, all sorts of
various things in different locales languages. But this is one dimension. If you can express that,
hey, we've got this variance happening, this message depends on these input values. This is a
dimension that we can express. Then of course, once we have a single pattern, a single sequence,
it might include placeholders. It might include the number n for how many apples you have or it
might be something entirely different. But then finally, we've ended up at least through the message
format to work, determining that markup should be kind of kept as a separate thing from placeholders.
So markup here means something like HTML. It doesn't need to be HTML. It can be any sort of
syntax or any sort of indicator that is saying that the content in this part of the message has
these attributes or these something about it. Then within a placeholder, we can have values
like numbers that we need to deal with. They can be just strings that are coming from an external
source. We can also have annotations on them. We can say that this number here, yeah, it's coming in
as a number, but I want it to be formatted. So it has at least two fraction digits, for instance.
This needs to be accounted for in the whole message, how it ends up being formatted. Then as I
mentioned, we need to be able to deal with variables because we are ultimately here talking about
the scope of dynamic messages. So we need to be able to say that explicitly that this message might
be getting a number from the outside. It might be getting some string content. It might be getting
anything as input, and it needs to deal with those. But sometimes we need to, within a message,
want to also do a little bit further processing on a variable value. We may want to
select a certain part of it, capitalize it if we're talking about a string, do other sorts of
transformations, or express the same sort of value in multiple different ways within a message.
So we need a little bit of a tooling to deal with variables. And that's it. That's like
through the working message format two, for the past like four years, we've not come up with
effectively anything else that really is core, driving the qualities of a formatable message.
And that's ended up meaning that one of the things we've produced out of this whole project is this
data model for what does a message look like. When you don't consider the syntax of it, when you
consider it as a data structure, I'm not going to go through like all of this. But roughly speaking,
we can say that a message has two different forms that it can take. It could be either just a single
pattern, single sequence that we're dealing with, or it can have some variance. That's the select
message over there, which then has some selectors from that when formatting guide us towards
selecting one of the variants of the message. The declarations help us declare these are some
of the input and local to this message sort of variables that exist. And then the variants of
the catch-all key end up defining how exactly do the, when we have multiple message variants,
how does that work really? And then when you get to within a single pattern,
again, as I alluded to, it can really just, obviously, contain literal content,
a string, or it can have expressions, placeholders that is, or it can have markup
that can be starting or opening. We also included standalone there, so you can have an element,
for example, an inline image be expressed within a message. Then we can have literals,
variable references, and the annotations that I mentioned. That's it. That's like these two
slides are defining the whole data model that we've ended up dealing with. Okay, I left some
like tiny little details out, like for example, the annotations, sorry, the expression, it needs
to have at least one of an argument or an annotation in order to be valid and stuff like this, and
minor details. But that's it. This is, we think, a universal data model for representing messages.
And I'm here basically saying, hey, I think this is kind of cool. And this is not necessarily
relevant for just the work specifically to do with message format to the syntax.
But more that this is effectively a data model that
can allow us to separate the concerns around the syntax of whether your messages are stored in
get text files, ICU message format, fluent, any, literally any format. You can take that syntax
and you can parse it into this data model structure representing a message. And this is, I think,
leading us to a world where we can consider more of a UNIX philosophy for, okay, what do we do now?
And I've, sort of, separation of concerns here. And I have, yes, cherry picked explicitly
the part of the UNIX philosophy where it says to do one thing and do it well.
And not included, for instance, the part about, you know, make sure that you're just
dealing, you're communicating as strings the values from one process to another because
that's kind of not necessarily working so well. Because we need those parsers. And if we need to
understand all of the structure in a message every time when we do it, we end up, for the most part,
mixing up the syntax concerns with everything else we're doing with messages. So as some of the
things you can do with this data model as ideas is that if you can read and write from a syntax to
this data model, and you can do this with multiple syntaxes, this is effectively an interface from
which you can take messages from one syntax, turn them into this data model representation,
and from there to any other syntax with caveats, but roughly.
Another thing is we can build tooling on top of this. So you can build a linter or a validator
on top of the data model representation of messages, rather than any syntax representation.
And this means that you can use the same validation for all messages independently of what syntax
they might be coming from. And if you have these capabilities, it means that when you have an
established many localization systems right now are very much monolithic. They have expectations
about this is the exact syntax in these sorts of files that are used for messages or resources.
This is exactly how you deal with them. And this is what you get included in your output or your
program, and this is exactly how it works. But as we're defining here a data model that can read
any of these syntaxes, it means that you can build a different sort of formatting or a runtime
on the same syntax. So you can start from the way you are now and maybe consider if you want to
change how you're doing localization. You don't need to change necessarily everything all at once,
but you can take just the formatting runtime change that to work with the same messages you've got,
and move on from there. Or vice versa, actually. You could change the message
structure, how do you store your messages and still use the same runtime because this is bringing
in an ability to very effectively transform your messages from one syntax to another. And you can,
when you're dealing with localization, you of course need to deal with translation.
And this means that you need to somehow present to your translators the messages that they're
working with. And if a translation tool or a framework is going through the message format
to data model, it means that you can build an interface for localizers. With the localizers,
don't need to know what is the syntax underneath for the placeholders, the variables, the markup,
anything else, but they can be presented the same thing for all syntaxes, which might make things
a little bit easier for everyone. So those are the ideas I came up with here for what could be
the next steps from here, but I'm here saying, hey, this is a cool thing. You guys should play around
with it. For us, the current and ongoing work is to extend this sort of a definition to also
include method resources and also include the sort of comments and metadata that is quite essential
for communicating the context of a message to translation, which as I'm kind of hoping some
of you noticed was completely left out of the earlier. But that's intentionally so that we
can separate these considerations from each other. But that's it for me. Thank you very much
for listening. I'd be very happy to have any questions, comments.
In another talk, I heard about message format 2 and function invocations.
How do function invocations, how does the data model work, or how do they relate?
The question is for how do function invocations relate to all of this? And this, yes, they are
represented here in the function annotations here. So something like, for example, plural selection
could use a function with a name of plural, for instance, for being this element existing
in a select message, selector expression, which is there.
Question was whether there are a set of built-in functions that are supported. And message
format 2 does come with a relatively small set of built-in functions. The data model itself does
not presume this set absolutely. But the set of functions can be extended. For message format 2
in particular, we are looking at a very minimum of effectively number, which also does plural and
ordinal selection, but also acts as a formatter. And then string, which is a sort of default
for string formatting, but also does the same sort of selection as ICU message format select does.
And we are still discussing for message format 2 what other things to include here. Now, of course,
when representing messages coming from some completely other syntax, it is entirely conceivable
that it is not directly possible to express these messages using the functions that message
format 2 defines out of the box. But the data model does allow for you to define that a function
like this must be used here, and you can otherwise define how that function works,
if that makes sense. And it's possible to make translations between these
function meanings. Anything more?
The reason to separate context from the minimum required effectively, and here I'm jumping
into the answer here, the minimum required for formatting a message is that the context is
absolutely required for the translation work. But the context is not absolutely required for
the message formatting. So we need to be able to represent it, but we do not absolutely need to have
it be a part of the message itself when it is being formatted. And this is why we are dealing
with it slightly separately. They are very much related concerns, but we've tried to find with
the data model the minimum required for representing a message. And when you trim down the minimum,
the context kind of ends up as a thing we can define externally, so we've chosen to be doing that.
And I mean if you're interested in that, in particular
the specifics of what should we include in the sort of base set of
metadata and context fields, here's an issue link where we're discussing this right now
that I would be very happy to have your input on.
Anything more?
Regarding the edit the translator tools, so now most translator tools
present a string and expect that the translator will write in a string. Do you imagine that this
will change and that the translator will see the elements of the data model in a more graphical way
and choose translations through Google boxes, or something like that? Or do you think it will
stay as a string representation for the translators in the future?
I have no idea and anything is possible and that's kind of cool. So predicting the future of what
the translator experience is going to be here is shall we say a hard question.
One thing I do think is that this sort of a data model makes it easier to build tools that
can present to a translator more the options and opportunities that they might have in modifying
a message and content like placeholders and markup which might just show up a syntax when
presented a string and be a challenge to even realize that I could change how this bit of it is
styled. But if we can present interfaces that can read the data model and understand from this
that hey hang on this could be tweaked this way, interfaces that are richer could be built. However
of course we do need to keep in mind that such a vast majority of cases are just it's best represented
as a pure string. So a majority of work is not going to change but the corner case is where it
gets interesting and challenging for those there might be opportunities to present messages in a
more translator friendly way. And one part of this I kind of skimmed over it was mentioned in the
Ujjwelts presentation yesterday on message format too is that here the selection for variants is
not an inline component as it is for example in ICU message format or fluent but the variants
all of the variants need to be complete messages presented at the sort of top level of the whole
message which is entirely intended to guide towards structures that are easier for translators to
deal with rather than needing to to figure out you have and then a selector of apples.
Instead of that you have a selector which has you have one apple you have three apples and
this sort of an interface. But yeah anymore? If not I would like to thank you very much for your time
and yeah that's it for me.