Hi, I'm Emily. I work for Mozilla. Yeah, so this is a talk I literally don't think I've got. I could give any wealth except an audience like the translations of ROOM at Vosdom. So I thought I did. I would. In my work at Mozilla on localization systems and tools and standards, recently I've ended up spending quite a bit of time participating in the Unicode Consortium's project to define message format 2, an evolution of the ICU message format standard and a bunch of other things. And I'm here not talking about that like specifically, but more like a side product of what we've ended up doing through that work, which is defining a data model for messages. In particular, messages that are not just a single segmented phrase that you've extracted and you might be able to send it to translation, but more dynamic content as well as everything else. And one of the interesting things about what we've ended up effectively not discovering, but kind of stating the obvious, is that there's an upper bound to this sort of what makes up a localizable segmented phrase or message really. That this is limited by the fact that the keyword localizable there because it's dealing with humans. Humans who need to understand it, but also translators who, well still, are mostly humans who need to be able to take in the the source message and be able to produce an output from that that is understandable in their locale. And this ends up depending on a limited number of different dimensions in which messages kind of vary. Variants have kind of hidden it there as the first one and there of course spoiled everything by saying so. It's the way that messages, message content can vary depending on inputs like numbers and their pluralization categories. You have no apples, you have one apple, you have 15 apples and gender-based determinants, grammatical gender, personal gender, all sorts of various things in different locales languages. But this is one dimension. If you can express that, hey, we've got this variance happening, this message depends on these input values. This is a dimension that we can express. Then of course, once we have a single pattern, a single sequence, it might include placeholders. It might include the number n for how many apples you have or it might be something entirely different. But then finally, we've ended up at least through the message format to work, determining that markup should be kind of kept as a separate thing from placeholders. So markup here means something like HTML. It doesn't need to be HTML. It can be any sort of syntax or any sort of indicator that is saying that the content in this part of the message has these attributes or these something about it. Then within a placeholder, we can have values like numbers that we need to deal with. They can be just strings that are coming from an external source. We can also have annotations on them. We can say that this number here, yeah, it's coming in as a number, but I want it to be formatted. So it has at least two fraction digits, for instance. This needs to be accounted for in the whole message, how it ends up being formatted. Then as I mentioned, we need to be able to deal with variables because we are ultimately here talking about the scope of dynamic messages. So we need to be able to say that explicitly that this message might be getting a number from the outside. It might be getting some string content. It might be getting anything as input, and it needs to deal with those. But sometimes we need to, within a message, want to also do a little bit further processing on a variable value. We may want to select a certain part of it, capitalize it if we're talking about a string, do other sorts of transformations, or express the same sort of value in multiple different ways within a message. So we need a little bit of a tooling to deal with variables. And that's it. That's like through the working message format two, for the past like four years, we've not come up with effectively anything else that really is core, driving the qualities of a formatable message. And that's ended up meaning that one of the things we've produced out of this whole project is this data model for what does a message look like. When you don't consider the syntax of it, when you consider it as a data structure, I'm not going to go through like all of this. But roughly speaking, we can say that a message has two different forms that it can take. It could be either just a single pattern, single sequence that we're dealing with, or it can have some variance. That's the select message over there, which then has some selectors from that when formatting guide us towards selecting one of the variants of the message. The declarations help us declare these are some of the input and local to this message sort of variables that exist. And then the variants of the catch-all key end up defining how exactly do the, when we have multiple message variants, how does that work really? And then when you get to within a single pattern, again, as I alluded to, it can really just, obviously, contain literal content, a string, or it can have expressions, placeholders that is, or it can have markup that can be starting or opening. We also included standalone there, so you can have an element, for example, an inline image be expressed within a message. Then we can have literals, variable references, and the annotations that I mentioned. That's it. That's like these two slides are defining the whole data model that we've ended up dealing with. Okay, I left some like tiny little details out, like for example, the annotations, sorry, the expression, it needs to have at least one of an argument or an annotation in order to be valid and stuff like this, and minor details. But that's it. This is, we think, a universal data model for representing messages. And I'm here basically saying, hey, I think this is kind of cool. And this is not necessarily relevant for just the work specifically to do with message format to the syntax. But more that this is effectively a data model that can allow us to separate the concerns around the syntax of whether your messages are stored in get text files, ICU message format, fluent, any, literally any format. You can take that syntax and you can parse it into this data model structure representing a message. And this is, I think, leading us to a world where we can consider more of a UNIX philosophy for, okay, what do we do now? And I've, sort of, separation of concerns here. And I have, yes, cherry picked explicitly the part of the UNIX philosophy where it says to do one thing and do it well. And not included, for instance, the part about, you know, make sure that you're just dealing, you're communicating as strings the values from one process to another because that's kind of not necessarily working so well. Because we need those parsers. And if we need to understand all of the structure in a message every time when we do it, we end up, for the most part, mixing up the syntax concerns with everything else we're doing with messages. So as some of the things you can do with this data model as ideas is that if you can read and write from a syntax to this data model, and you can do this with multiple syntaxes, this is effectively an interface from which you can take messages from one syntax, turn them into this data model representation, and from there to any other syntax with caveats, but roughly. Another thing is we can build tooling on top of this. So you can build a linter or a validator on top of the data model representation of messages, rather than any syntax representation. And this means that you can use the same validation for all messages independently of what syntax they might be coming from. And if you have these capabilities, it means that when you have an established many localization systems right now are very much monolithic. They have expectations about this is the exact syntax in these sorts of files that are used for messages or resources. This is exactly how you deal with them. And this is what you get included in your output or your program, and this is exactly how it works. But as we're defining here a data model that can read any of these syntaxes, it means that you can build a different sort of formatting or a runtime on the same syntax. So you can start from the way you are now and maybe consider if you want to change how you're doing localization. You don't need to change necessarily everything all at once, but you can take just the formatting runtime change that to work with the same messages you've got, and move on from there. Or vice versa, actually. You could change the message structure, how do you store your messages and still use the same runtime because this is bringing in an ability to very effectively transform your messages from one syntax to another. And you can, when you're dealing with localization, you of course need to deal with translation. And this means that you need to somehow present to your translators the messages that they're working with. And if a translation tool or a framework is going through the message format to data model, it means that you can build an interface for localizers. With the localizers, don't need to know what is the syntax underneath for the placeholders, the variables, the markup, anything else, but they can be presented the same thing for all syntaxes, which might make things a little bit easier for everyone. So those are the ideas I came up with here for what could be the next steps from here, but I'm here saying, hey, this is a cool thing. You guys should play around with it. For us, the current and ongoing work is to extend this sort of a definition to also include method resources and also include the sort of comments and metadata that is quite essential for communicating the context of a message to translation, which as I'm kind of hoping some of you noticed was completely left out of the earlier. But that's intentionally so that we can separate these considerations from each other. But that's it for me. Thank you very much for listening. I'd be very happy to have any questions, comments. In another talk, I heard about message format 2 and function invocations. How do function invocations, how does the data model work, or how do they relate? The question is for how do function invocations relate to all of this? And this, yes, they are represented here in the function annotations here. So something like, for example, plural selection could use a function with a name of plural, for instance, for being this element existing in a select message, selector expression, which is there. Question was whether there are a set of built-in functions that are supported. And message format 2 does come with a relatively small set of built-in functions. The data model itself does not presume this set absolutely. But the set of functions can be extended. For message format 2 in particular, we are looking at a very minimum of effectively number, which also does plural and ordinal selection, but also acts as a formatter. And then string, which is a sort of default for string formatting, but also does the same sort of selection as ICU message format select does. And we are still discussing for message format 2 what other things to include here. Now, of course, when representing messages coming from some completely other syntax, it is entirely conceivable that it is not directly possible to express these messages using the functions that message format 2 defines out of the box. But the data model does allow for you to define that a function like this must be used here, and you can otherwise define how that function works, if that makes sense. And it's possible to make translations between these function meanings. Anything more? The reason to separate context from the minimum required effectively, and here I'm jumping into the answer here, the minimum required for formatting a message is that the context is absolutely required for the translation work. But the context is not absolutely required for the message formatting. So we need to be able to represent it, but we do not absolutely need to have it be a part of the message itself when it is being formatted. And this is why we are dealing with it slightly separately. They are very much related concerns, but we've tried to find with the data model the minimum required for representing a message. And when you trim down the minimum, the context kind of ends up as a thing we can define externally, so we've chosen to be doing that. And I mean if you're interested in that, in particular the specifics of what should we include in the sort of base set of metadata and context fields, here's an issue link where we're discussing this right now that I would be very happy to have your input on. Anything more? Regarding the edit the translator tools, so now most translator tools present a string and expect that the translator will write in a string. Do you imagine that this will change and that the translator will see the elements of the data model in a more graphical way and choose translations through Google boxes, or something like that? Or do you think it will stay as a string representation for the translators in the future? I have no idea and anything is possible and that's kind of cool. So predicting the future of what the translator experience is going to be here is shall we say a hard question. One thing I do think is that this sort of a data model makes it easier to build tools that can present to a translator more the options and opportunities that they might have in modifying a message and content like placeholders and markup which might just show up a syntax when presented a string and be a challenge to even realize that I could change how this bit of it is styled. But if we can present interfaces that can read the data model and understand from this that hey hang on this could be tweaked this way, interfaces that are richer could be built. However of course we do need to keep in mind that such a vast majority of cases are just it's best represented as a pure string. So a majority of work is not going to change but the corner case is where it gets interesting and challenging for those there might be opportunities to present messages in a more translator friendly way. And one part of this I kind of skimmed over it was mentioned in the Ujjwelts presentation yesterday on message format too is that here the selection for variants is not an inline component as it is for example in ICU message format or fluent but the variants all of the variants need to be complete messages presented at the sort of top level of the whole message which is entirely intended to guide towards structures that are easier for translators to deal with rather than needing to to figure out you have and then a selector of apples. Instead of that you have a selector which has you have one apple you have three apples and this sort of an interface. But yeah anymore? If not I would like to thank you very much for your time and yeah that's it for me.